Unified portal for regulatory and splicing elements for genome analysis

ABSTRACT

A method, including identifying, in a nucleotide string, at least two exons, at least one acceptor, at least one donor, and at least one intron between the at least two exons, is provided. The method includes identifying, in the nucleotide string, a cryptic splice site comprising a sequence of nucleotides based on a similarity score with at least one of the acceptor or the donor, and graphically marking, in a display for a user, the nucleotide string at a location indicative of an exon, an intron, a true splice site, and optionally a cryptic splice site when the similarity score is higher than a pre-selected threshold. A system and a non-transitory, computer-readable medium including instructions to cause the system to perform the method are also provided.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of International Application No.PCT/US2021/047025, filed Aug. 20, 2021, entitled “A UNIFIED PORTAL FORREGULATORY AND SPLICING ELEMENTS FOR GENOME ANALYSIS”, which claimspriority under Article 8 of the PCT to U.S. Provisional Application No.63/166,803, entitled “A UNIFIED PORTAL FOR REGULATORY AND SPLICINGELEMENTS FOR GENOME ANALYSIS,” to Sudar Senapathy, filed on Mar. 26,2021, and to U.S. Provisional Application No. 63/166,829, entitled “APRECISION MEDICINE PORTAL FOR HUMAN DISEASES,” to Periannan Senapathy,filed on Mar. 26, 2021, the contents of both applications incorporatedherein by reference in their entirety, for all purposes.

This application is related to International Application No.PCT/US2021/047027 filed Aug. 20, 2021, which claims the benefit of U.S.Provisional Patent Application No. 63/166,803, filed Mar. 26, 2021, andU.S. Provisional Patent Application No. 63/166,829, filed on Mar. 26,2021, each of which is owned by Applicant and is incorporated herein byreference and which is not admitted to be prior art with respect to thepresent invention by its mention in the cross-reference section.

BACKGROUND Field

The present disclosure relates generally to a platform of networkedcomputing devices for performing a comprehensive analysis of generegulation within the human genome. More specifically, the presentdisclosure provides a map of genes and mutations thereof for anindividual or a cohort of individuals, and their functionality andphenotypic manifestations for use in disease diagnostics andtherapeutics and in the analysis of other inherited traits in the humangenome.

Related Art

In the field of genomic analysis, much relevance is given to proteinencoding portions of the genome. However, little is known as to otherportions of the genome that may not encode proteins, but may be linkedto disease and other phenotypic traits yet to be discovered. However,there is a lack of a systematic approach to search, classify, identify,and illustrate coding and non-coding portions of the genome andassociated mutations.

SUMMARY

In a first embodiment, a computer-implemented method includesidentifying, in a nucleotide string, at least two exons, at least oneacceptor, at least one donor, and at least one intron between the atleast two exons, identifying, in the nucleotide string, a cryptic splicesite including a sequence of nucleotides based on a similarity scorewith at least one of the acceptor or the donor, and graphically marking,in a display for a user, the nucleotide string at a location indicativeof an exon, an intron, a true splice site, and optionally a crypticsplice site when the similarity score is higher than a pre-selectedthreshold.

In a second embodiment, a computer-implemented method includesidentifying a first amino acid string corresponding to a functionalprotein or protein domain, aligning said first amino acid string with atleast one additional amino acid string that encodes a functional variantof said functional protein, identifying, at each amino acid positionwithin said additional amino acid string, multiple variable amino acidsthat appear in the at least one additional amino acid string for eachaligned location in the first amino acid string, and graphicallymarking, in a display for a user, a variable amino acid as an allowableamino acid at an aligned location in said first amino acid string.

In a third embodiment, a computer-implemented method includesidentifying, in a nucleotide string, at least two exons, and at leastone intron between the at least two exons, and a promoter sequence,selecting, within the nucleotide string, a cryptic promoter siteincluding a sequence of nucleotides resembling the promoter sequence,associating a score to the cryptic promoter site based on a similarityscore between the cryptic promoter site and the promoter sequence, andgraphically marking, in a display for a user, the nucleotide string at alocation indicative of the cryptic promoter site when the score ishigher than a pre-selected threshold.

In a fourth embodiment, a computer-implemented method includesidentifying, in a nucleotide string, a poly-A addition site, wherein thepoly-A addition site includes a poly-A site and a signal, selecting,within the nucleotide string, a cryptic poly-A site, the cryptic poly-Asite including a sequence of nucleotides resembling at least one of thepoly-A sites, associating a similarity score to the cryptic poly-A sitebased on a similarity between the cryptic poly-A site and a real poly-Asite, and graphically marking, in a display for a user, the nucleotidestring at a location indicative of the cryptic poly-A site when thesimilarity score is higher than a pre-selected threshold.

In yet another embodiment, a computer-implemented method includesidentifying a first nucleotide string corresponding to a non-coding RNAgene. The computer-implemented method also includes aligning said firstnucleotide string with at least one additional nucleotide string thatspecifies a functional variant of said non-coding RNA gene, andidentifying, at each nucleotide position within said additionalnucleotide string, multiple variable nucleotides that appear in the atleast one additional nucleotide string for each aligned location in thefirst nucleotide string. The computer-implemented method includesgraphically marking, in a display for a user, a variable nucleotide asan allowable nucleotide at an aligned location in said first nucleotidestring.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an architecture of devices and systems for providinga personalized product service, according to some embodiments.

FIG. 2 illustrates the details for devices and systems in thearchitecture of FIG. 1, according to some embodiments.

FIGS. 3A-3F illustrate details of exon splices, according to embodimentsdisclosed herein.

FIGS. 4A-4C illustrate details of cryptic splices, according toembodiments disclosed herein.

FIGS. 5A-5C illustrate an exon chart, according to embodiments disclosedherein.

FIGS. 6A-6D illustrate exemplary embodiments of alternative splices, asdisclosed herein.

FIGS. 7A-7E illustrate exemplary embodiments of an exon frame, asdisclosed herein.

FIGS. 8A-8D illustrate exemplary embodiments of a protein signature,according to embodiments disclosed herein.

FIGS. 9A-9F illustrate exemplary embodiments of an un-translated portionof a genome, according to embodiments disclosed herein.

FIGS. 10A-10B illustrate exemplary embodiments of a branch point in agenome, according to embodiments disclosed herein.

FIGS. 11A-11B illustrate exemplary embodiments of a non-coding RNA map,according to embodiments disclosed herein.

FIG. 12 illustrates a process for finding a variable and a non-variablesequence signature of a protein, according to some embodiments.

FIG. 13 is a flowchart illustrating steps in a method for identifyingand displaying a cryptic site in a nucleotide string, according to someembodiments.

FIG. 14 is a flowchart illustrating steps in a method for creating anddisplaying a protein signature in an amino acid string, according tosome embodiments.

FIG. 15 is a flowchart illustrating steps in a method for identifyingand displaying a cryptic promoter site in a nucleotide string, accordingto some embodiments.

FIG. 16 is a flowchart illustrating steps in a method for identifyingand displaying a cryptic poly-A site in a nucleotide string, accordingto some embodiments.

FIG. 17 is a block diagram illustrating an example computer system withwhich the client and server of FIGS. 1 and 2 and the methods of FIGS.13-16 can be implemented.

In the figures, elements or steps having the same or similar labels areassociated with features or processes having the same or similardescription, unless otherwise stated.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are setforth to provide a full understanding of the present disclosure. It willbe apparent, however, to one ordinarily skilled in the art, that theembodiments of the present disclosure may be practiced without some ofthese specific details. In other instances, well-known structures andtechniques have not been shown in detail so as not to obscure thedisclosure.

The present disclosure is directed to a platform for the comprehensiveanalysis of gene regulation and splicing in genes within the humangenome. The platform provides a basis for the analysis of regulatory andsplicing elements in the human genome, their cryptic versions, andextensive details of the molecular processes that occur at every levelof gene expression, transcription, splicing, and translation of everygene in the human genome. In some embodiments, the platform disclosedherein facilitates the analysis of mutations and aberrations in theseprocesses at the structural, molecular, and sequence levels, and theirassociations with various diseases. A platform as disclosed hereinenables the analysis of sequence elements and protein factors thatassist in tissue specific gene expression and alternative splicing oftranscripts, thus enabling the understanding of basic biologicalprocesses, and the mutations that cause various tissue and organspecific cancers and diseases. Further, it finds the potentialadditional genes in the unexplored region of the genome (e.g., the darkmatter genome), and focuses on the analysis of the regulatory andsplicing elements and their cryptic versions in these genes, and themutations that occur in various diseases thereof. A platform asdisclosed herein thus enables the thorough analysis of the processes ofgene regulation and splicing, and the aberrations within them that causediseases from genes in the human genome. This platform is useful tobiologists, practicing clinicians, and clinical researchers to study andunderstand gene regulation and splicing and their aberrations.

Biologists have traditionally expected that most disease-causingmutations occur in protein-coding regions (CDS), as they directly affectthe proteins. Thus, studies focused on regulatory elements have beenlargely ignored, and consolidated tools to address these regions havebeen lacking. However, it is increasingly becoming apparent thatmutations in regulatory and splicing elements are responsible forupwards of 60% of all diseases. Embodiments as disclosed herein addressthis shortcoming of current genome analysis and provide a tool for asystematic analysis of the regulatory and splicing elements, and theeffect of mutations in them. The platform provides the ability for acomprehensive analysis of gene regulation and splicing, and theiraberrations due to mutations.

Human genes contain exons that are the protein-coding portions of thegene, and introns that do not code for the protein. The exons aresequence portions that are expressed into protein sequences, and theintrons are the sequences that interrupt the exons and have regulatorysequences within them. Exons are usually short with an average length of˜120 bases, whereas introns are usually very long, with an averagelength of ˜6,000 bases. The human genes contain an average of ˜10 exons,but a considerable fraction of genes consists of a large number of exonsand introns, up to 200 exons. The gene is copied into an RNA transcript,from which the introns are excised and the exons are “glued” together tosynthesize a functional protein.

The exons are spliced together to form the complete coding sequence, andthe introns that interrupt the coding sequence are eliminated. A complexmachinery called spliceosome, that contains ˜300 proteins and five smallnuclear RNAs (snRNAs), carries out this splicing process. In addition tothe coding sequence, there exists several regulatory elements includingthe promoter, transcription initiation site, splicing sites, branchpoint sites, enhancers and silencers of the promoter and splicing sites,poly-A addition sites, and un-translated regions upstream (5′ UTR) anddownstream (3′ UTR) of the coding sequence.

Promoter sites regulate the expression of genes by binding the RNApolymerase enzyme to the promoter sequence(s) and initiating thetranscription at the transcription start site. Several transcriptionalregulatory proteins also bind to multiple regions of the promoter siteand enable complex regulation of genes. Furthermore, elements known asenhancers and silencers of the promoter sites enhance or suppress geneexpression respectively. By utilizing different promoter enhancers andsilencers in various tissues and organs, the expression of tissue/organspecific genes are regulated. Mutations in any of these regions candisrupt gene regulation, thus leading to disease. In addition, mutationsin the transcription initiation sites also affect the gene expressionand can lead to disease.

RNA splicing is carried out using sequence signals bordering theexon-intron junctions. A splice sequence at the 5′ end of the intron(e.g., the donor site), and another splice sequence at the 3′ end of theintron (e.g., the acceptor site), aids in this process. In addition, asignal known as the branch point sequence within the intron also assistsin the process of splicing. Mutations in these regions lead toaberrations in the splicing process, and are known to cause numerouscancers and non-cancer diseases Enhancers and silencers of these splicesites are present within the exonic and intronic regions, which enhanceor suppress the splicing process, respectively. By utilizing differentsplicing enhancers and silencers in various tissues and organs, thespliceosome can produce alternative splicing transcripts in differentorgans. Mutations in these regions also lead to diseases.

The 5′ un-translated region (5′-UTR) of an mRNA plays a critical role inthe regulation of translation. It contains functional elements thatfine-tune the process of protein expression. Mutations in these regionsare associated with a number of human diseases. The poly-A additionsites present in the 3′ UTR region contribute to mRNA stability,translation control, and nuclear export of the mRNA. There may existmultiple poly-A sites in many genes, thus producing more than onetranscript from a single gene. Mutations in these sites lead todisruption of these processes causing numerous diseases.

For every regulatory element, there exists sequences that resemble thegenuine (real) sequences within the gene and are known as cryptic sites.When mutations occur in the real or cryptic sites, cryptic sites may beused instead of the real sites and may cause aberrations in thetranscription, splicing, and translation processes. Thus, it isdesirable to identify and analyze cryptic regulatory and splicingelements and cryptic exons. Cryptic regulatory elements and crypticexons play a major role in disease causation. Analysis of these sitespaves the way to identify disease associations, diagnosis, and treatmentof several diseases.

There are thus multiple regulatory elements and their cryptic versionsin every gene that are desirable to correctly transcribe, splice, andtranslate the gene to its protein. Mutations in any of these regions maydisrupt each of these processes leading to cancers and several otherdiseases. It is increasingly realized that errors in these processescontribute to disease. However, the field has traditionally focused onthe coding regions of the genes for deciphering the genetic causes ofdiseases, largely ignoring the regulatory regions.

The coding regions of genes constitute only 2% of the human genome, andthe other 98% are introns, which do not code for proteins. In addition,the known genes only include ˜30% of the genome. The remaining ˜70% ofthe genome are intergenic regions without any known genes. However,genes that are not yet discovered may occur in these regions, andmutations within them may cause diseases. In addition, genes can alsooccur within the long intron sequences in the known genes. A platform asdisclosed herein identifies these genes in unexplored regions of thehuman genome (e.g., the dark matter genome), and applies its features onthese genes to further study the regulatory and splicing processes anddisease-causing mutations in them.

Embodiments of this disclosure may analyze all or multiple portions ofthe human genome including non-coding (nc) RNA genes (e.g., tRNA, rRNA,miRNA, snoRNA, snRNA, and lncRNA), and further drilling down into thesequence views of elements displayed on the gene sequence, depictingthem in different color codes. Embodiments of a platform as disclosedherein also provide statistics analyzed for selected genes in variousorganisms, other than human, including animals, microbes, plants, fungi,and viruses. A platform as disclosed herein may display the informationon “before and after exon exclusion” with statistics of domains, totalnumber of different consequences for all possible/predicted domains, andexon exclusion events.

Proteins have a paramount importance as the functional and structuralunit of a number of physiological processes. An aberration in theprotein structure may lead to molecular diseases with profoundalterations in biological and metabolic functions. A protein containsone or more domains which are its basic units. Each domain carries out aspecific biochemical or biological function, and using multiple domainstogether, a protein can accomplish complex biological functions.Although genes carry the biological information that constructs anorganism, it is the proteins that are the true workhorses of the cells,tissues, and organs, and in fact, the whole organism. The sequence of aprotein is not constant; rather, it is variable. There can be more thanone amino acid present at many positions in a protein sequence, withoutaltering the structure and function of the protein, which are calledvariable amino acid positions. Only at some positions, an amino acid isinvariant or can vary to a limited number of amino acids, which arecalled invariant or low variant amino acid positions.

Accordingly, high amino acid variability occurs frequently in a protein,as only a few amino acids may be desirable in specific locations toenable key functions such as the active site of the protein. The rest ofthe amino acids aid to bring about the three dimensional structure ofthe protein in such a way that the active sites are correctly placed tocarry out its function. These amino acids, other than the active sites,can be allowed to vary (e.g., be replaced by other amino acids), withoutaltering the structure or function of a protein. Thus, protein sequencesexhibit amino acid variability or degeneracy, which provides a definiteset of variable amino acids at each position of the protein, forming avariable amino acid sequence signature. There exist few invariablepositions, which exhibit one to few allowed amino acids. Mutations inthese invariable or low variable positions will alter the active sitesfor the protein that lead to a defective protein unable to carry out itsfunction, and thus lead to disease. Most of the mutations in the highlyvariable positions are tolerated and are said to be benign. In addition,an alternating rhythm of hydrophilic (water loving) and hydrophobic(water repelling) amino acids has been found to be largely sufficient tomaintain the protein structure and function. Thus, the pattern of thesehydrophobic and hydrophilic amino acids forms a major part of the studyof protein structure and function, and the aberrations due to mutations.In addition, protein domains form secondary structures that haveimplications for protein stability, which are affected by mutations indisease. Accordingly, embodiments as disclosed herein provide a platformto visualize, analyze, query, and search for variability in proteinstructure and signature, and to correlate with it the effect(deleterious or not), in the phenotypic trait of a subject, of mutationsor other genetic aberrations.

Embodiments as disclosed herein provide a robust platform for acomprehensive and thorough analysis of the genetic and associatedmolecular processes of disease and other phenotypic traits. It furtherenables the analysis of mutations and aberrations in these processes atthe structural, molecular, and sequence levels, and their associationswith various diseases and disorders. Thus, a platform consistent withthe present disclosure provides a basic foundation for the analysis ofregulatory elements in the human genome. This foundation enables furtheranalysis of sequence elements and protein factors that assist in tissuespecific gene expression and alternative splicing of transcripts, thusenabling the understanding of the basic biological processes anddiseases.

Alternative splicing is a regulatory process occurring in eukaryotes,where it greatly increases the biodiversity of proteomes that can beencoded from the genome. During gene expression, exons are splicedalternatively in different isoforms that results in multiple proteinscoded by a single gene. In this process, specific exons of a gene may beincluded or excluded from the processed messenger RNA (mRNA) producedfrom that gene. Consequently, the proteins translated from alternativelyspliced mRNAs will contain differences in their amino acid sequence and,often, in their biological structure, functions, and clinicalassociations. The several modes of alternative splicing that aregenerally recognized are, exon skipping (one of the most common modes),in which an exon may be spliced out of the primary transcript orretained; mutually exclusive exons, either one of two exons, is retainedin mRNAs after splicing; alternative donor site, an alternative donorother than that in the canonical transcript is used, changing the 3′boundary of the upstream exon; alternative acceptor site, an alternative3′ splice junction is used, changing the 5′ boundary of the downstreamexon; intron retention, a partial intron sequence may be retained.Furthermore, there occur other mechanisms of generating different mRNAsfrom a single gene such as multiple promoters and multiplepolyadenylation sites. Use of multiple promoters is properly describedas a transcriptional regulation mechanism rather than alternativesplicing; by starting transcription at different points, transcriptswith different 5′-most exons can be generated. At the other end,multiple polyadenylation sites provide different 3′ end points for thetranscript.

The production of alternatively spliced mRNAs is regulated by a systemof trans-acting proteins that bind to cis-acting sites on the primarytranscript, including splicing activators that enhance the usage of aparticular splice site, and splicing repressors or silencers that reducethe usage of a particular site. Mechanisms of alternative splicing arehighly variable and various methods are used to elucidate and predictthe regulatory systems involved in splicing by a “splicing code.” Inaddition, errors in alternative splicing due to mutations in the splicesites, cryptic sites, and branch point sites, enhancers and silencers,and other regulatory elements can lead to aberrations in alternativesplicing resulting in truncated or defective protein. Splicingaberrations have a profound impact contributing to a larger proportionof genetic disorders, various cancers, and other diseases.

Embodiments as disclosed herein provide a computer-implemented platformto address both of these mechanisms with an alternative splicing moduleto query and analyze a variety of mRNAs that may be derived from a gene.In some embodiments, the alternative splicing module provides analysisof potentially deleterious effects of alternative splicing aberrations,and a correlation of these with disease and other phenotypic traits in asubject.

The current understanding of molecular biology is that the flow ofinformation takes place from DNA, to RNA, to protein through variousbiological processes. The first step is the formation of the RNAtranscript (copy) of the gene that forms the bridge between DNA to themature mRNA that is ready to be translated into proteins. Upstream ofthe gene sequence lies the promoter sequence, which forms the controlregion that switches on or off the gene. The RNA polymerase binds tothis promoter sequence to make the RNA transcript. The introns in theRNA transcript are spliced out thereby linking together the exons tomake the mature mRNA. There exist regulatory sequences upstream anddownstream of the mRNA region within the transcripts that are nottranslated into protein. These un-translated regions, known as the 5′UTR (upstream) and 3′ UTR (downstream) contain regulatory elements thatregulate the transport and translation of the mRNA into protein.

Accordingly, embodiments consistent with the present disclosureillustrate the properties of these promoters and un-translated regions(UTRs) in the gene transcripts and mRNA sequences, enabling furtherinteractive analysis by the user. Some embodiments include classifyingexons in the gene according to whether they are coding,partially-coding, or non-coding, and shows splice site sequences andtheir scores. It then locates any upstream and downstream open readingframes (u-ORFs and d-ORFs) that surround the real ORF of the mRNA. AKozak consensus sequence is a motif that functions as the proteintranslation initiation site within the mRNA. A mutated, wrong start sitecan result in non-functional proteins and have implications in humandiseases. In some embodiments, a platform as disclosed herein calculatesa score for Kozak consensus sequences that exist upstream and downstreamof the start codon ATG, which would indicate which of the u-ORFs andd-ORFs may be turned on in different biological contexts.

A branch point sequence is a regulatory element that aids thespliceosome to form a loop with an intron before splicing an upstreamexon with a downstream exon to form the mRNA. Embodiments as disclosedherein enable the analysis of branch point sequences and their crypticversions to play an important role in understanding the molecularmechanisms of splicing and their disease associations.

Mutations within branch point sequences disrupt the lariat formation andresult in aberrations in splicing. Incorrect splicing due to aberrationsin branch point sequences are responsible for 9-10% of the geneticdiseases that are caused by point mutations and lead to various effectsin splicing, including exon skipping due to improper binding of the SF1and U2 snRNP splicing proteins and disruption of the natural acceptorsplicing site or intron retention (whole or its fragment) if they createa new 3′ splice site. Mutations within a cryptic branch sequence maycause aberrations that can incorrectly splice the gene transcript andlead to various cancers and other diseases. Accordingly, a platform asdisclosed herein identifies real and cryptic branch point sequencesthroughout the genes and possible mutation events thereof. Moreover,embodiments as disclosed herein may correlate the findings with publiclyavailable disease and annotation databases.

A platform as disclosed herein may determine whether a mutation in thebranch point sequence may cause splicing aberrations and the type andmechanism of such aberrations (such as exon skipping and introninclusion). Thus, the platform can be ideal to discover novel branchpoint mutations from the individual subject's genome or genomes from acohort of subjects. The platform's approach of predicting real andcryptic BPS within a gene or any sequence, and detecting the deleteriousmutations within the branch points acts as an effective strategy forclinicians and researchers in analyzing the splicing defects associatedwith disease. Also, branch point mutations establish a valuable resourcefor further investigations into the genetic encoding of splicingpatterns and interpreting the impact of common and disease-causing humangenetic variation in gene splicing.

Non-coding RNAs (ncRNAs) are functional molecules that are onlytranscribed and not translated into proteins. A large fraction of thehuman genome constitutes non-coding elements such as small non-codingRNAs (miRNA, piRNA, SiRNA, SnRNA), and long non-coding RNAs (linc RNA,NAT, eRNA, circ RNA, ceRNAs, PROMPTS). These ncRNAs mediate theregulation of gene expression, and play critical roles in defining DNAmethylation patterns. The mis-regulation of lncRNAs is often associatedwith cancer and other diseases.

Transfer ribonucleic acid (tRNA) helps in decoding the messenger RNA(mRNA) into a protein. tRNAs function at specific sites in the ribosomeduring the translation process, synthesizing a protein from an mRNAmolecule. tRNA also has introns 14-60 bases in length that interrupt theanticodon loop. tRNA splicing is a rare form of splicing that involves adifferent biochemistry than the spliceosomal and self-splicing pathways.

Ribosomal RNA (rRNA) associates with a set of proteins to formribosomes. These complex structures, which physically move along an mRNAmolecule, catalyze the assembly of amino acids into protein chains. Theyalso bind tRNAs and various accessory molecules necessary for proteinsynthesis.

MicroRNAs (miRNAs) are key regulators of biological processes inanimals. These small RNAs form complex networks that regulate celldifferentiation, development, and homeostasis. Deregulation of miRNAfunction is associated with many human diseases, including cancer. Thus,it has become important to understand the mechanisms that modulate miRNAactivity, stability and cellular localization through alternativeprocessing and maturation, sequence editing, post-translationalmodifications of Argonaute proteins, viral factors, transport from thecytoplasm, and regulation of miRNA-target interactions. In addition,analysis of mutations in miRNA genes are key to understanding thedisease in subjects and in cohorts.

Cellular mechanisms controlling the gene expression by microRNAs andalternative splicing have an effect on proteome diversity and have beenimplicated in complex diseases such as cancer and other disorders.Variations in the miRNA sequence and/or variations in the miRNA targetregion of a transcript can have a major impact on post-transcriptionalregulation. Events of alternative splicing can occur in more than halfof the human genes, thereby changing the sequence of key proteinsrelated to drug resistance, activation, and metabolism. Furthermore,alternative splicing and miRNAs can work together to differentiallycontrol genes.

Embodiments as disclosed herein illustrate molecular aberrations ofvariants in the ncRNA genes and their correlations with diseasesincluding cancers, non-cancer diseases, and multisystemic disorders. Themutations that disrupt the cellular functions which are dependent onnon-coding RNA genes, or the factors required for the RNA functions, canbe deleterious. The tRNA, rRNA, miRNA, siRNA, snoRNA, snRNA, and lncRNAgenes are analyzed and the pathogenicity of mutations within them areestablished which is of major diagnostic importance. Approaches areexplored to modify the splicing pattern of a mutant ncRNA or replace anRNA gene that bears a disease-causing mutation to achieve therapy.

Embodiments as disclosed herein identify and illustrate SNPs and Indelsat the miRNA-related functional regions such as 3′-UTRs and pre-miRNAsand are key targets to uncover gene dysregulation resulting insusceptibility to or onset of human diseases. The deleterious mutationsin the mitochondrial transfer RNA (mt-tRNA) and mitochondrially encodedrRNA (mt-rRNA) genes are known to cause many genetic diseases. Defectsin oxidative phosphorylation in mitochondria are often associated withimpairment of processes such as replication, transcription, ortranslation of mtDNA, which can be due to mutations in either of themtDNA-encoded RNAs (tRNAs and rRNAs). mt-tRNA mutations can lead toseveral diseases including neurosensory non-syndromic hearing loss,diabetes mellitus, and a diverse range of clinical phenotypes. si-RNAmutations also may be involved in disease. Discovering thedisease-causing mutations in the ncRNA genes, and identifying themolecular mechanisms of disease, has the potential benefit for bothdiagnostics and treatment of several diseases.

Moreover, embodiments as disclosed herein enable the identification ofthe various ncRNA genes in a genome, and the details of these genesincluding their promoters, exons, introns, and their associatedenhancer/silencer elements, prediction of deleterious mutations and themolecular mechanisms, illustration of these details in gene structure,tabular and sequence views, and enabling the various interactiveanalysis capabilities. Furthermore, the analysis of variability in thencRNA gene sequences plays an important role in deciphering the diseaseassociations.

Other elements in embodiments as disclosed herein may include:

Exon Splice: to predict whether potential exon skipping events thatarise through alternative splicing would maintain or destroy the openreading frame of the gene.

Cryptic Splice: to find cryptic splice sites and cryptic exons in eachgene based on user-defined score thresholds.

Exon Chart: to generate a graph of exon lengths within each gene,creating a visual chart of patterns such as outlying exons and lengthrepetition.

Alternative Splice: to depict alternative splicing events such as exonskipping, intron retention, and alternative splice site usage in each ofthe predicted isoforms of a given gene.

Exon Frame: to create an exon-intron map for each gene. It locates theexons in three reading frames and displays the patterns of stop codonswithin introns.

Protein Signature: to highlight allowed and not-allowed AA substitutionsat each position in protein domains, generating a unique AA signaturefor each domain.

UTR view: to illustrate the untranslated regions of mRNA sequences,including promoters, uORFs, dORFs, start and stop codon contexts, andpoly-A signals.

Branch Points: to enable the study of branch points in genes, theirinvolvement in splicing of exons and cryptic exons, and the consequencesof mutations in them.

Enhancers and Silencers of gene regulation and splicing: A map toprovide insights on the enhancers and silencers of regulatory andsplicing elements in human genes, their association in gene regulationand splicing events, and the effects of their mutations in dysregulationof genes.

Non-coding RNA Genes: A map that facilitates visualizing the processesof splicing and mutations within the different non-coding (nc) RNAgenes, and their implication in human diseases.

Dark Matter Genomics: A map that describes all of the coding, generegulatory and splicing elements and their cryptic versions in the newgenes identified within the introns of known genes and within thepotentially undiscovered genes within the long intergenic regions.

Splice database: a database for the findings from each of the SpliceAtlas maps, providing an integrated platform to analyze the regulatoryand splicing elements of every gene from the genome in a single view.

The human genome includes ˜3.2 billion bases. However, only 1-2% of thehuman genome codes for proteins. The coding sequences (exons) forproteins constitute a very small fraction of the gene itself, and therest of the gene consists of introns and un-translated regions. Theintrons in numerous genes are extremely long, often longer than 100,000bases and up to more than a million bases, which may contain unknowngenes and regulatory sequences. In addition, there exists large regionsof DNA sequences located between genes, defined as intergenic spaces.The function of most of these regions are currently unknown. However,these regions may contain sequences that regulate nearby genes, longnon-coding RNAs, and genes that are yet undiscovered. Together, thesenon-coding genomic regions, that include the introns in the currentlyknown genes and the intergenic regions between the currently knowngenes, constitute ˜98% of the genome. It is thought that these regionsdefined as the dark matter of the genome may be very important to thefunctioning of the genome, and mutations in them may lead to numerousdiseases.

Embodiments of the present disclosure define the dark matter genome asthe regions within the genome that include the introns in the currentlyknown genes and the intergenic regions between the currently knowngenes. Accordingly, a platform consistent with the present disclosuredefines white matter genome as the currently known and annotated genes,excluding the potential genes present within the introns. In someembodiments, a platform as disclosed herein identifies potential genes,protein-coding sequences, and the regulatory regions of theseprotein-coding genes, as well as the non-coding RNA genes, in the darkmatter genome. Accordingly, some embodiments applied the functionalitiesof multiple modules therein on these newly discovered genes and obtainedthe various details for CDS and regulatory genetic elements, and theircryptic versions that occur within these genes. Some embodiments includemodules to focus on the dark matter of the genome to unravel theirhidden wealth and enable the discovery of important genetic informationthat will advance the understanding of disease and drug response,ultimately benefiting the practice of medicine. It aims to decipherthese important regions within the dark matter of the genome anddiscover their involvement in disease by uncovering them in cohortstudies from subjects with different diseases and adverse reactions todifferent drugs.

Accordingly, some embodiments work on a basic principle thatdeleterious, disease-causing mutations would be enriched in the gene(s)that cause the disease in a cohort of subjects within any of the geneticelements including the CDS, and the different regulatory elements in thegene, such as the promoter, UTR, splice donor, acceptor, and branchsites, enhancers and silencers, and poly-A sites, and their crypticversions throughout the gene sequence. Thus, the platform approaches thediscovery of the disease-causing genes by identifying the deleteriousmutations in multiple different regulatory elements and their crypticversions throughout the gene across the subject cohort. It alsoapproaches this problem by identifying the deleterious mutations fromselected elements within the intergenic regions, as the cryptic versionsof these elements occur throughout the genes including the otherelements, UTR, exons, and introns, and the intergenic regions.Embodiments as disclosed herein use the Shapiro & Senapathy (S&Salgorithm) method and other relevant algorithms for detecting the splicesites and mutations in them, to develop unique scoring methods for thedifferent regulatory elements by using the unique PWMs for the differentelements based on their respective consensus sequences and the specificlengths of these elements. With this basic approach, Splice Atlas hasdiscovered that it is able to identify deleterious mutations enrichedwithin the different regulatory regions in addition to the codingregions. Furthermore, it has discovered that the deleteriousdisease-causing mutations are enriched in cryptic sites for thedifferent regulatory elements that occur throughout the genes.

The human genome is currently thought to consist of ˜19K genes (19,127).However, it is likely that genes in the human genome are not yetdeciphered for several reasons. The gene finding programs rely on theknowledge of known proteins to determine if a gene should be consideredvalid. There are ˜300 types of tissues in human, and many of theproteins expressed in them are of very low frequency, which are yetunknown. Furthermore, there are a large number of genes that areactivated at different space-times, and then switched off, during theembryological development, many of which are also unknown. Thus, manyproteins are yet to be uncovered from the human genome, which may occurwithin the long introns (>10,000 bases, 20,521 introns in the humangenome) and within the intergenic regions (total sequence of length 2.8billion bases).

Dark Matter Genome Human Genome Length  3.2 billion bases Number ofgenes  19,127 Length of all genes  1.2 billion bases Total number of allexons 200,603 Total length of all exons 62.5 million bases Total numberof all introns 181,458 Total length of all introns  1.1 billion basesTotal number of introns >10,000 bases  20,521 Total length ofintrons >10,000 bases  728 million bases Total length of all intergenicregions =  2.0 billion bases Length of the genome − Total length of allgenes Total length of all Dark Matter Genome = 3.09 billion bases Lengthof all introns from current genes + length of all intergenic regionsTotal length of introns >10,000 bases + total length of intergenicregion  2.8 billion basesData for this table have been obtained from our analysis of the humangenome data from the NCBI (GRCh37.p13 assembly).

The estimates of the number of genes in the human genome varyconsiderably from ˜20,000-40,000. The current estimate from the NationalHuman Genome Research Institute is 30,000. In addition, the number ofgenes in the human genome was thought to be 24,500 until 2007. In thatyear, by tweaking the maximum ORF length a bit shorter, the number ofgenes reduced to ˜20,000 based on the lack of their evolutionaryconservation. It is also reasonable to expect that the current limit inthe number of human genes reflects a desire to enable a practical set orcatalogue of genes for research and medical applications, although manymore genes could exist. Thus, there are strong reasons to expect thatthere could be many more genes yet to be discovered in the human genome.

Embodiments as disclosed herein include methods to identify and explorethese undiscovered genes. A platform as disclosed herein uses multiplegene finding software programs (including the Shapiro & Senapathy,Splice Atlas Splice Code, GenScan, Augustus, and GeneID) to find genesfrom the dark matter genome. In addition, a platform as disclosed hereinuses the PfamScan database to uncover potential domains in these genes.These processes are expected to produce overlapping genes. However, theyare advantageous to ensure that genes are not missed from the intergenicregions. Furthermore, a platform as disclosed herein could enable otherplatforms to use these newly discovered genes in individual subject,family, and cohort studies, wherein the occurrence of disease relevantmutations in these regions can be determined. Embodiments as disclosedherein also enable the application of all of its maps on the newly foundgenes from each of the gene finding programs, and creates a database ofselected data. This further enables the analysis of subject mutationsfrom dark matter genes to identify the known and subject mutations thatcause disease and drug response phenotypes.

Example System Architecture

FIG. 1 illustrates an architecture 100 of devices and systems forproviding a map of genes and mutations thereof for an individual or acohort of individuals, according to some embodiments. A server 130 maybe coupled with a database 152 storing a genome sequence log for each ofmultiple users handling client devices 110. Servers 130, database 152,and client devices 110 may be communicatively coupled with each othervia a network 150.

Servers 130 may interact and communicate with other devices in network150 via any one of multiple interfaces and communications protocols(e.g., wired, cable, wireless, and the like). More specifically, servers130 and client devices 110 may include an appropriate processor, memory,and communications capability, configured to interact with network 150via a digital interface. Client devices 110 may include, for example,desktop computers, mobile computers, tablet computers (e.g., includinge-book readers), a digital stand in a retailer store, mobile devices(e.g., a smartphone or PDA), wearable devices (e.g., smart watch and thelike), or any other devices having appropriate processor, memory, andcommunications capabilities for accessing one or more of servers 130through network 150. In some embodiments, client devices 110 may includea Bluetooth radio or any other radio-frequency (RF) device for wirelessaccess to network 150. The memory in the client device from the retailermay include instructions from an application programming interface (API)hosted by server 130 (e.g., downloaded from, updated by, and incommunication with server 130). The API in client devices 110 may beconfigured to cause client devices 110 to execute steps consistent withmethods disclosed herein.

Network 150 can include, for example, any one or more of a local areanetwork (LAN), a wide area network (WAN), the Internet, and the like.Further, network 150 can include, but is not limited to, any one or moreof the following network topologies, including a bus network, a starnetwork, a ring network, a mesh network, a star-bus network, tree orhierarchical network, and the like.

FIG. 2 is a block diagram 200 illustrating an example server 130 andclient device 110 in architecture 100, according to certain aspects ofthe disclosure. Client device 110 and server 130 are communicativelycoupled over network 150 via respective communications modules 218-1 and218-2 (hereinafter, collectively referred to as “communications modules218”).

Communications modules 218 are configured to interface with network 150to send and receive information, such as data, requests, responses, andcommands to other devices on the network. Communications modules 218 canbe, for example, modems or Ethernet cards. Client device 110 may becoupled with an input device 214 and with an output device 216. Inputdevice 214 may include a keyboard, a mouse, a pointer, or even atouch-screen display that a user may use to interact with client device110. Likewise, output device 216 may include a display and a speakerwith which the user may retrieve results from client device 110. Clientdevice 110 may also include a processor 212-1, configured to executeinstructions stored in a memory 220-1, and to cause client device 110 toperform at least some of the steps in methods consistent with thepresent disclosure. Memory 220-1 may further include an application 222,including specific instructions which, when executed by processor 212-1,cause a graphic payload 225 hosted by server 130 to be displayed for theuser in output device 216. Graphic payload 225 may include multiplegraphic illustrations of a nucleotide string requested by the user toserver 130. The user may store at least some of the illustrations andpartial nucleotide strings from graphic payload 225 in memory 220-1.

In some embodiments, memory 220-1 may include an application 222,configured to display and process the contents in graphic payload 225.Application 222 may be installed in memory 220-1 by server 130, togetherwith the installation of an operating system that controls all hardwareoperations of client device 110.

Server 130 includes a memory 220-2, a processor 212-2, andcommunications module 218-2. Processor 212-2 is configured to executeinstructions, such as instructions physically coded into processor212-2, instructions received from software in memory 220-2, or acombination of both. Memory 220-2 includes a genome sequence analysisengine 242. In some embodiments, genome sequence analysis engine 242includes a sequence scoring tool 244, a mutation tool 246, a statisticstool 248, and an algorithm 250 to manipulate genome sequence data andcreate charts and reports for graphic payload 225.

Sequence scoring tool 244 parses at least a portion of a nucleotidestring from a genome to identify a splicing site therein. Morespecifically, sequence scoring tool 244 identifies, in a nucleotidestring, at least two exons, at least one acceptor, at least one donor,and at least one intron between the at least two exons. In someembodiments, sequence scoring tool 244 may include identifying, in anucleotide string, at least two exons, and at least one intron betweenthe at least two exons, and a promoter sequence. In some embodiments,sequence scoring tool 244 may include identifying, in a nucleotidestring, a poly-A addition site, wherein the poly-A addition siteincludes a poly-A site and a signal. In some embodiments, sequencescoring tool 244 may include identifying a first amino acid stringcorresponding to a functional protein or protein domain.

Mutation tool 246 may identify protein domains affected by mutations inthe nucleotide string that may alter the splicing sites (according tosequence scoring tool 244). In some embodiments, mutation tool 246 mayaccess a mutation log in database 252, to identify a recurring mutationover a cohort or a population of individuals. In some embodiments,mutation tool 246 may identify, in a nucleotide string, a positivesignature when the nucleotide string codes an allowed amino acid in thefunctional protein, and a negative signature when the nucleotide stringcodes a non-allowed amino acid in the functional protein. In someembodiments, mutation tool 246 determines a deleterious effect of amutation based on whether the mutation occurs within the positivesignature or the negative signature in a protein domain. In someembodiments, mutation tool 246 identifies, in a nucleotide string codinga protein domain in the functional protein, a mutation leading to adisallowed amino acid. In some embodiments, mutation tool 246 determinesa mutated hydropathy signature of the protein domain based on ahydropathy of a mutated amino acid. In some embodiments, mutation tool246 determines a normal hydropathy signature of the protein domain basedon a hydropathy of an allowed amino acid or a disallowed amino acid, anda deleteriousness score for the mutation based on a difference betweenthe mutated hydropathy signature of the protein domain and the normalhydropathy signature of the protein domain. In some embodiments,mutation tool 246 also determines a deleteriousness score for themutation based on whether a mutation occurs within a positive signatureindicating no deleteriousness or a negative signature indicating adeleteriousness.

Statistics tool 248 may perform a frequency analysis over the splicesites and the mutations identified by sequence scoring tool 244 andmutation tool 246. In some embodiments, statistics tool 248 may usemutation logs and gene sequencing logs in database 252 to evaluatestatistical data on a nucleotide string for an individual or a cohort ofindividuals, for analysis. Algorithm 250 may be a linear or non-linearalgorithm, including a neural network, machine learning, or artificialintelligence algorithm used to identify and score splicing sites (e.g.,for sequence scoring tool 244). For example, in some embodiments,algorithm 250 may include the Shapiro & Senapathy algorithm to score anucleotide string as a splice site (e.g., a ‘donor’ site or an‘acceptor’ site), a MaxEntScan algorithm, and an NNSplice algorithm,among others. Algorithm 250 may combine various algorithms including theupdated version of the Shapiro & Senapathy algorithm to developbiological probability and impact of the various splicing event datathroughout the genome.

In some embodiments, genome sequence analysis engine 242 enables a userto search a subject's genome based on the gene nomenclature, domainidentifiers, clinical association, number of domains, and number ofexons per domain based on the user's preferences. In some embodiments,genome sequence analysis engine 242 enables the user to search asubject's genome based on genes established and sourced from database252 (e.g., a third party database such as NCBI). In some embodiments,genome sequence analysis engine 242 enables the user to search asubject's genome based on a protein domain identifier (e.g., usingdatabase 252 such as Pfam ID) according to a dropdown selection list ingraphic payload 225. In some embodiments, genome sequence analysisengine 242 enables the user to search a subject's genome by a number ofdomains encoded by a gene. In genes with multiple domains, this searchoption is based on the genes with the highest number of domains. In someembodiments, genome sequence analysis engine 242 enables the user tosearch the subject's genome based on the number of exons within a gene.In genes with multiple domains, this search option is based on thedomain that is encoded by the highest number of exons.

Database 252 blends NCBI, Ensembl, and UCSC to combat various needs ofresearch and clinical genomics needed for the industry and handlescomplex queries and supplies seamless data towards various maps forsplice. In some embodiments, database 252 is robust as it combinesvarious external sources including NCBI's GRCh37.p13, monarchinitiative, COSMIC—v91, ClinVar—clinvar_20200407, dbSNP—b153 and Pfaminto one single database to handle various complex needs of splicingacross genome. In some embodiments, database 252 collaborates GenBank,EMBL Data Library, DDBJ, NBRF PIR, Protein Research Foundation,SWISS-PROT, and Brookhaven Protein Data Bank into a unified data. Insome embodiments, database 252 is scalable to adapt various organisms'data and incubated under optimum normalization mechanisms. So thatcollects and disseminates the burgeoning amount of nucleotide and aminoacid sequence data. In some embodiments, database 252 includes highintense data, which reveals dark matter genomics with appropriateevidence material. In some embodiments, database 252 includesmulti-level annotations including correlation between various variationsand phenotypes, with supporting evidence. In some embodiments, database252 includes an integrative database of abundance of several differenttypes of coding and non-coding sequence of the whole genome. Database252 provides data flexibility to generate or produce the high intensitydata to determine real splice sites, cryptic splice sites, and crypticexons from genes of the human genome.

In some embodiments, genome sequence analysis engine 242 enables theuser to search genes having a high frequency of cryptic splice sitesthat can be searched based on the number of cryptic sites (e.g., rangingfrom 1-80,000). In some embodiments, genome sequence analysis engine 242enables the user to search genes having a high frequency of crypticexons (e.g., ranging from 1-100,000) in a transcript. The cryptic exonscan be visualized for individual transcripts for the selected gene. Insome embodiments, genome sequence analysis engine 242 enables the userto search genes with different ranges of cryptic scores that can besearched (for example, >50, >60, >70, >80, and >90 to choose from). Thecryptic splice sites can be visualized for individual transcripts forthe selected gene. In some embodiments, genome sequence analysis engine242 enables the user to search genes with different ranges of crypticexon scores (for example, >50, >60, >70, >80, and >90 to choose from).The cryptic exons can be visualized for individual transcripts for theselected gene. In some embodiments, in genome sequence analysis engine242, the user can choose the canonical transcript with the most numberof exons for viewing and analysis. In some embodiments, genome sequenceanalysis engine 242 enables the user to visualize the cryptic splicesites and cryptic exons for the genes falling under various exceptionalfeatures including an in-frame stop codon (TAA, TGA, TAG) inside thereading frame, that contains a seleno-cysteine stop codon (mostly TGA)in the coding sequence, or that contains no stop codons (TAA, TGA, TAG)at the end of CDS. In some embodiments, genome sequence analysis engine242 enables the user to search the subject's genome based on thecanonical and non-canonical transcript identifiers, which may be listedin graphic payload 225 for the user's selection. In some embodiments,genome sequence analysis engine 242 enables the user to search thesubject's genome based on a clinical association. Accordingly, a diseaseassociation of somatic cancer, germline cancer, inherited disorders,industrial panels, drug metabolizing gene (DMG) panels, the AmericanCollege of Medical Genetics and Genomics (ACMG) gene panels may beprovided by genome sequence analysis engine 242 to the user in adropdown list in graphic payload 225. Based on the selection criteria,genome sequence analysis engine 242 displays gene and transcriptinformation for the user via graphic payload 225. For example, graphicpayload may include a gene name, a chromosome number, a gene ID, astrand, a protein ID, a protein length, and a number of exons. Inaddition, graphic payload 225 may include an information strip includinga “Gene Info” button to display details on gene ontology and phenotype.

Server 130 may also include different modules which, in collaborationwith the tools in genome sequence analysis engine 242, enable thedifferent applications and aspects disclosed herein. For example, someof the modules include an exon splice module 260-1, a cryptic splicemodule 260-2, an exon chart module 260-3, an alternative splice module260-4, an exon frame module 260-5, a protein signature module 260-6, an“un-translated” (UTR) view module 260-7, a branch point sequence (BPS)view module 260-8, a regulatory module 260-9, a non-coding (nc) RNA mapmodule 260-10, and a dark matter module 260-11 (hereinafter,collectively referred to as “modules 260”). Exon splice module 260-1identifies exons in a nucleotide string, and provides data analysisregarding the proteins and protein domains codified by the exons, andthe possible protein isoforms or deleterious effects produced by aminoacid rearrangements and other effects or mutations.

Exon splice module 260-1 indicates exon splices and provides aprediction of exon splice consequences, and may include a visualizationtool based on genes established and sourced from external databases,libraries, and sources. Exon splice module 260-1 may also include asearch engine with a protein-wise nomenclature based on selectedidentifiers, in a drop-down selection list. Exon splice module 260-1enables the search for genes based on various search criteria asmentioned above. Based on the selection criteria, the gene andtranscript information are displayed. Information like gene name,chromosome number, gene ID, strand, protein ID, protein length, andnumber of exons are displayed along with details on gene ontology andphenotype on clicking the “Gene Info” button available in an informationstrip. In some embodiments, exon splice module is divided into threedifferent sections: after splicing view, sequence view, and hydropathy.The consequences of exon splicing in the gene are depicted under theafter splicing view tab, including AA maintained, AA changed, AAchange+PTC, frameshift, frameshift+PTC, overlapping domains, domaindisruption, and domain skipping. The complete coding sequence, beforeand after splicing, of the selected transcript are shown in the sequenceview tab. The hydropathy plot depicts the hydropathy index valuesdetermined by various methods along the amino acids sequence of theselected transcript.

A list of genes from an external database, library, or resource (NCBL,ENSEMBL, and the like) may be downloaded and integrated into Splicedatabase 252 to provide the list of genes, exons, coding sequence, 5′and 3′ UTRs, poly-A signal sequences, promoter sequences, and clinicalassociation of genes with diseases (as sourced from dbSNP, COSMIC, andClinVar). The exons are classified based on their coding features into5′ and 3′ noncoding sequences, 5′ and 3′ partially coding sequences,fully coding sequences, upstream open reading frames (uORFs), downstreamopen reading frames (dORFs), poly-adenylated tails, kozak sequencecontents, and various promoter boxes (TATA, GC, CAAT, and initiator),each of which are computed, identified, and tagged.

Cryptic splice module 260-2 uses algorithm 250 (e.g., the Shapiro &Senapathy algorithm) to identify cryptic splice sites and cryptic exonsin human genes. Cryptic splice module 260-2 is a beneficial tool thathelps investigate splicing mutations in disease, as Cryptic Splice Sites(CSSs) and cryptic exons are known to be involved in numerous diseases.More generally, cryptic versions of every regulatory element occurwithin a gene sequence. Furthermore, cryptic exons also occur throughoutthe gene sequence. Cryptic splice module 260-2 identifies one or more ofthese elements throughout the gene sequence, and displays them ingraphical, tabular, and sequence views. Cryptic splice module 260-2 alsodetermines the mutations that occur within these elements, and displaysthe details in various forms of illustrations from a subject sequencedata and from various public data sources including dbSNP, ClinVar, andCOSMIC. Cryptic splice module 260-2 also identifies the cryptic versionsof other regulatory elements throughout the gene sequence, and themutations in them, and provides detailed illustrations in various forms.

Exon chart module 260-3 enables visual classification and analysis ofexon lengths and their accompanying splicing features, including unusualexon patterns in distinct genes. In some embodiments, exon chart module260-3 applies algorithm 250 (e.g., the Shapiro & Senapathy algorithm andother relevant algorithms) to determine the scores of real and crypticsplice sites in the outlier exons and other exons in a gene. In someembodiments, exon chart module 260-3 enables the analysis of outlyingexons that have highly outlying lengths compared to the other exons inthe gene, and their real splice sites, cryptic splice sites, real exons,cryptic exons, branch point sites, enhancers and silencers, and theirscores. In some embodiments, exon chart module 260-3 displays regulatoryelements and their cryptic versions within the outlying exon ingraphical, tabular, and sequence views. In some embodiments, exon chartmodule 260-3 enables the graphical depiction of exons with repeatedlengths and outlying exons in a gene, and their correlations with thesplice donor, acceptor and exons scores, and their DNA and proteinsequences, using dropdowns for user selection of these features andtheir involvement in disease. In some embodiments, exon chart module260-3 enables various searching options using nested search boxes forthe user to choose the genes with gene length, CDS length, genes havingexon length repetition, exons with outlying lengths, disease associatedwith such genes, and exceptional genes with these features. In someembodiments, exon chart module 260-3 enables the search option for genesfrom various gene panels such as disease panels, drug metabolizing gene(DMG) panels, the American College of Medical Genetics and Genomics(ACMG) gene panels, and other user given gene panels and enabling thevisualization and analysis of any gene provided. In some embodiments,exon chart module 260-3 provides the capability to analyze differentexon classes based on length, length of the preceding and followingexons and introns, and the scores of the acceptor and donor splicesites.

In some embodiments, exon chart module 260-3 provides the capability toanalyze different sets of exons, each set with the same lengths, andtheir splice scores, exon sequences, amino acid sequences, and theability to analyze various parameters such as if the sequences of exonsof the same length are similar or different, and determining if thesplice site sequences and scores are similar or different. In someembodiments, exon chart module 260-3 depicts the real and cryptic splicesites by employing Shapiro & Senapathy and other relevant algorithms andcomparing the scores for exons with repeat lengths in genes from anygiven organism, including the human, in an automated manner. In someembodiments, exon chart module 260-3 enables the automated analyses ofthe many features of an exon chart and providing the tabular, graphical,and sequence representation for the analysis of every gene from anyorganism including animals, plants, and microorganisms. In someembodiments, exon chart module 260-3 classifies and analyzes exons basedon their coding features into 5′ non-coding sequences, 3′ non-codingsequences, 5′ partially-coding sequences, 3′ partially-coding sequences,and fully coding sequences for the genes with repeated exon lengths andoutlying exons. In some embodiments, exon chart module 260-3characterizes the various exons present in a gene into multiplecategories based on their length to identify the exon length repetition,highest exon lengths to signify the “outliers” in a gene, and theexception codons which contain no stop codon, in-frame stop codon, orselenocysteine codon sequences. In some embodiments, exon chart module260-3 creates a repository containing information for genes in a genomesuch as exon details with the exon length, genomic position of theexons, transcript details, real/cryptic splice donors and acceptors,splicing scores, and enabling the display and analysis of any gene by aquery. In some embodiments, exon chart module 260-3 enables a search forgenes that fit various parameters of exon lengths, gene lengths, outlierexon lengths, exons with the same lengths, non-coding, partial codingand fully coding exon lengths, genes from different gene panels, andgenes from different diseases, and determines if any disease correlateswith such genes or vice versa, and the ability to analyze these genes ingraphical, tabular, and sequence illustrations. In some embodiments,exon chart module 260-3 overlays the subject(s)′ mutations on the genewith depictions in an exon chart, in graphical gene structure andsequence illustrations in color codes for depicting the features ofexons, promoter boxes, 5′ and 3′ UTRs, real/cryptic splice sequences,poly-A site and region, branch point regions, and the ability to analyzethem for different parameters of exons provided by an exon chartincluding the correlation of the subject mutations with gene features.In some embodiments, exon chart module 260-3 enables analysis ofenhancers and silencers in the outlying exons, especially the first andlast exons, to determine if the long lengths are required in order toaccommodate these regulatory sequences or signals. In some embodiments,exon chart module 260-3 indicates the consequences of a mutation ingraphical and sequence illustrations, and plotting subject mutations ina real or cryptic splice and exonic regions, and the known mutationsfrom the different databases such as dbSNP, ClinVar, and COSMIC, andcategorized into clinical significance, molecular consequence, variationtype and pathogenicity based on the SIFT and/or PolyPhen scores on anygene chosen by the user. In some embodiments, exon chart module 260-3enables the query and analysis of different parameters of genes in anexon chart for the detection and analyses of unusual length repetitionpatterns and splicing patterns in distinct genes, and possible diseaseconnections.

In some embodiments, exon chart module 260-3 provides variousinformation of exons in a gene and its associated elements such asprotein family and domains, ontology information, disease phenotypesusing i-icons, mouse hovers, and context-sensitive popups. In someembodiments, exon chart module 260-3 applies an exon chart algorithm anddisplays real and cryptic splicing elements, exons, introns, andabnormalities identified in the ncRNA genes (tRNA, rRNA, miRNA, snoRNA,snRNA, and lncRNA), and further drilling down into the gene structureand sequence views of elements displayed on the gene sequence, depictingthem in different color codes. In some embodiments, exon chart module260-3 enables the user-guide of the platform such as the “About” thatprovides context sensitive explanations for various features andapplications, and “How To” that provides context sensitive informationof how to use particular features throughout the different sections ofthe platform. In some embodiments, exon chart module 260-3 providesstatistics for genes in a given organism and displaying the informationbased on different length ranges, comparing the distribution of geneshaving repetitive length, and genes having outlier exons, and depictingthese statistics in various bar and pie charts. In some embodiments,exon chart module 260-3 enables the use of tightly coupled navigation byinterlinking different sections to provide analysis of a gene, protein,or other elements and features throughout the platform.

Alternative splice module 260-4 uses algorithm 250 (e.g., the Shapiro &Senapathy algorithm and other relevant algorithms) to identifyalternative splicing events such as exon skipping, intron retention, andalternative splice site usage in each of the predicted isoforms of thegiven gene. In some embodiments, alternative splice module 260-4provides a catalog of predicted alternative transcripts in human genes,including those that may or may not genuinely encode distinct proteins.Alternative spice module 260-4 identifies unique splicing events in thealternative transcripts when compared with a canonical transcript, suchas exon skipping, exon inclusion, intron retention, and alternativesplice site usage.

Alternative splice module 260-4 maps alternative splice events and theirmolecular effects in different transcripts of a gene compared with thecanonical transcript, which is defined by various methods. In addition,it also maps these details based on constitutive exons defined byvarious methods. In alternative splice module 260-4, differences amongtranscripts are also correlated with changes in the encoded structuraldomains, thereby capturing the functional regions of proteins thatalternative splicing may normally or deleteriously affect. Alternativesplice module 260-4 thus simplifies the prediction of the particulartranscripts resulting in distinct proteins and distinguishes them withthe artifacts of mistaken sequence annotation, which is key to theadvancement of the field of clinical genomics and Precision Medicine. Insome embodiments, alternative splice module 260-4 enables thevisualization of known mutations, mutations from individual subjects andcohorts of subjects. In addition to the mutational analysis, alternativesplice module 260-4 also provides analysis of the domains encoded bydifferent isoforms of a gene in a single view. Thus, alternative splicemodule 260-4 provides insight into aspects of alternative splicing ingenes, their impacts on functional domains, and mutational analysis.

Alternative splice module 260-4 provides multiple ways to view andanalyze alternative splicing events, such as based on gene: Thealternative splicing events can be visualized for individual transcriptsfor the selected gene; and based on clinical association: thealternative splicing events can be visualized for individual transcriptsfor the genes implicated in the panels for all major cancers andinherited disorders. In some embodiments, alternative splice module260-4 provides alternative splicing events, wherein the user can selecta particular transcript of a given gene and explore differentalternative splicing events including skipped exons, cryptic exons,exons with alternative acceptor splice sites, exons with alternativedonor splice sites, exons with alternative acceptor and donor splicesites, and retained introns together. In some embodiments, alternativesplice module 260-4 identifies genes based on a number of transcripts(and selects the highest, or one of the highest): Genes having a highnumber of transcripts can be searched (e.g., ranging from 1-28). Thealternative splicing events can be visualized for individual transcriptsfor these selected genes.

Alternative splice module 260-4 displays unique splicing eventsresulting in alternative transcripts when compared with the canonicaltranscript, such as exon skipping, exon addition, intron retention, andalternative splice site usage from any gene in a genome. In someembodiments, alternative splice module 260-4 correlates the differencesamong transcripts with changes in the encoded structural domains,thereby capturing the functional regions of proteins that alternativesplicing may deleteriously affect, simplifies the prediction oftranscripts ascertaining if the resulting amino acid sequences aredistinct proteins or the artifacts of mistaken sequence annotation, andperforms deeper pattern analysis of variations in alternativetranscripts and alternatively spliced sites among different transcriptsof a given gene. In some embodiments, alternative splice module 260-4provides a “View All” option that layers the information from thecanonical transcripts, current transcript, and the splice events acrossvarious methods such as “Canonical based” and “Exon based,” in oneunified view. In some embodiments, alternative splice module 260-4enables a user driven approach to identify and correlate the clinicalassociation of mutations in various alternative splicing eventsincluding constitutive, cryptic, altered donor, altered acceptor,altered acceptor+donor, skipped exon, intron retention, in any cancersand non-cancer disorders. In some embodiments, alternative splice module260-4 provides a pop-up window for explaining the alternative splicingevents based on the occurrences of alternative spliced exons indifferent transcripts defined by various methods including canonical andconstitutive, and displays the layers of events such as coding exons,pre-spliced domains, and splice events in the canonical transcript andtranscript isoforms, and plotting the mutations from the subject andknown mutations from various mutation-disease databases and highlightingthem in both gene structure, tabular and sequence view, providing deeperanalytical capabilities. In some embodiments, alternative splice module260-4 displays possible alternative splicing events of the transcript inmultiple views, based on possible combinations of canonical transcriptand constitutive exons, including partially-coding and non-coding exonsand un-translated regions. In some embodiments, alternative splicemodule 260-4 illustrates various forms of splicing events in atranscript having longest CDS, longest mRNA, highest number of mRNAexons, and highest number of coding exons, and depicting various formsof transcripts based on protein-coding and non-coding exons in the genesfrom the genome of any organism in an automated manner; and identifiesmutations in the alternative splicing sites within the introns, exons,or any part of a gene from a subject, computing the scores for themusing Shapiro & Senapathy and other relevant algorithms, and determiningthe pathogenicity of these mutations by comparing the scores with thesplicing sites of normal sequence.

In some embodiments, alternative splice module 260-4 displays thelocations of alternative splicing events and alternative spliced exonsin a transcript and overlaying the subject(s)′ mutations and knownmutations from various public databases in graphical, tabular, andsequence view with pop-up boxes, mouse hovers, and context sensitiveexplanations. In some embodiments, alternative splice module 260-4enables nested search boxes for the user to choose the genes based onnumber of transcripts, alternative splicing events (e.g., skipped exons,cryptic exons, exons with alternative acceptor splice sites, exons withalternative donor splice sites, exons with alternative acceptor anddonor splice sites, and retained introns), disease associated genes, andexceptional genes; and predicts the splice events in genes from aportion of the genome or the whole genome without manual intervention,and enabling the automated analysis of every data point. In someembodiments, alternative splice module 260-4 provides information aboutthe gene and its associated elements such as protein family and domains,ontology information, disease phenotypes using i-icons, mouse hovers,and context-sensitive popups while depicting alternatively splicedevents and transcripts; and compares the domains within the selectedtranscript with those in the canonical transcript, and highlighting theportions of the canonical domains that have been removed, and theportions that have been added, in the selected transcript, in differentcolors.

In some embodiments, alternative splice module 260-4 displays theexon-intron structure of a selected gene with alternative splicingevents in color-coded visuals and providing automated display ofgraphical and sequence illustrations in an expanded view; enables thesearch option for genes from various gene panels such as disease panels,drug metabolizing gene (DMG) panels, the American College of MedicalGenetics and Genomics (ACMG) gene panels, and other user given genepanels; displays splice event details in an expanded view on clickingany of the splice events on the graphical illustration. In someembodiments, alternative splice module 260-4 includes an alternativesplice prediction, analysis, and illustration to non-coding RNA genes(e.g., tRNA, rRNA, miRNA, snoRNA, siRNA); provides statistics analysisfor genes in a genome, displaying the information on coding andnon-coding sequences, distribution of alternative exon classes for eachgene, and the frequencies of coding and non-coding transcript per gene,in tabular and graphical illustrations.

In some embodiments, alternative splice module 260-4 represents theconsequences of a mutation in the alternatively spliced structures ofthe transcripts and the gene with graphical and sequence illustrations,and plotting subject mutations in a real or cryptic splice and exonicregions, and the known mutations from different databases such as dbSNP,ClinVar, and COSMIC, and categorized into clinical significance,molecular consequence, variation type, and pathogenicity based on theSIFT and/or PolyPhen scores, in graphical, tabular, and sequenceillustrations. In some embodiments, alternative splice module 260-4 isapplicable to any organism including animals, plants, andmicroorganisms. In some embodiments, alternative splice module 260-4analyzes subject mutations overlaid on the genes to correlate theirinvolvement in disease; analyzes mutations from databases such asClinVar, dbSNP, and COSMIC on the genes to correlate their involvementin disease; and enables the user-guide of the platform such as the“About” that automatically provides context sensitive explanations forvarious features and applications, and “How To” that automaticallyprovides context sensitive information of how to use particular featuresthroughout the different sections of the platform. In some embodiments,alternative splice module 260-4 provides a map for genes in variousorganisms and displaying various statistics and information across thegenome, enables the use of tightly coupled navigation by interlinkingdifferent sections to provide analysis of a gene, protein, or otherelements and features throughout the platform; and enables an expandedversion for each of the features of AltSplice map, which allows users tovisualize and analyze further details in graphical, tabular, andsequence illustrations.

Exon frame module 260-5 determines the possible distribution of stopcodons and coding exons in a reading frame before and after splicingevents. A reading frame is a way of dividing the sequence of nucleotidesinto a set of consecutive, non-overlapping triplets, where thesetriplets equate to amino acids or stop signals during translation, whichare called codons. In some embodiments, exon frame module 260-5 analyzesand verifies that a distance in the nucleotide string between two stopcodons while mapping different stop codons should not fall inside anexon region. To verify this, the length of each of the exons and theopen reading frame are plotted separately. The exon with maximum lengthin any transcript should be lesser than the maximum distance between twostop codons in all the reading frames. After splicing, the CDS lengthshould be shorter than the maximum distance between two stop codons.

In some embodiments, exon frame module 260-5 allows the determination,analysis, and illustration of the exon-intron structures across ORFpatterns of a gene and determines the structure of a gene withrespective reading frames that contain exons of a gene and the patternsof before and after splicing by constructing an image of the entiresplit gene, including the exons, introns, splice junction signals, andstop codons that occur within each frame. In some embodiments, exonframe module 260-5 streamlines the detection of atypical gene patterns,such as long exons, long open reading frames without annotated exons, orshort introns, and illustrates exons and ORFs in a single reading frameof the gene along with their splice sites and scores calculated usingalgorithm 250 (e.g., the Shapiro & Senapathy algorithm and otherrelevant algorithms) In some embodiments, exon frame module 260-5represents three reading frames of a transcript, along with all possiblestop codons in each reading frame and plotting the coding exons inappropriate reading frames by using the reading frame algorithm.

In some embodiments, exon frame module 260-5 displays exons, splicesites, branch sites, stop codons, and all possible splicing signalswithin UTRs, partially-coding and non-coding exons and un-translatedregions in “Before” and “After” splicing visual illustrations, andprovides capabilities for studying the distribution pattern of ORFs andexons in a gene by comparing their distribution and frequencies in thegene sequence and randomly generated sequences in graphics, tabular, andsequence views. In some embodiments, exon frame module 260-5 identifiesmutations in a single reading frame from one or more subjects within thesplice sites, exons, or any part of a gene, computing the scores forthem using algorithm 250 (e.g., the Shapiro & Senapathy algorithm andother relevant algorithms), and determines the pathogenicity of thesemutations by comparing the differences in the scores with the normalsequence. In some embodiments, exon frame module 260-5 illustratesmutations from the subject's genome and known mutations from variouspublic gene-disease databases in graphical, tabular, and sequence viewwith pop-up boxes, mouse hovers, and context sensitive explanations, anddisplays a scatter plot showing the distribution of the lengths of ORFs,exons, spliced exons, mRNA, and CDS in the gene and randomly generatedsequences, and enabling the comparison of these features between thegene and random sequences.

In some embodiments, exon frame module 260-5 enables nested search boxesfor the user to choose the genes and transcripts based on exon and ORFlength range, gene length, CDS length, disease associated genes, andexceptional genes, provides information about the gene and itsassociated elements such as protein family and domains, ontologyinformation, disease phenotypes using i-icons, mouse hovers andcontext-sensitive popups, and enables search options for genes fromvarious gene panels such as disease panels, drug metabolizing gene (DMG)panels, the American College of Medical Genetics and Genomics (ACMG)gene panels, and other user given gene panels. In some embodiments, exonframe module 260-5 displays a visual graphic of the structure of geneswith exons, promoters, poly-A sites, stop codons, branch points, spliceenhancers and silencers, and splice sites in a compact and expanded viewalong with relevant details showing the length of each exon, ORFs, andspliced exons in graphical, tabular, and sequence illustrations. In someembodiments, exon frame module 260-5 predicts the exon frame features ofa gene, plotting the subject's mutations or any known mutations in thecoding and non-coding regions, comparing the mutations from differentgene-disease databases such as dbSNP, ClinVar, and COSMIC, determiningthe connection to various diseases, and visualizing their clinicalimpacts on gene structure and sequence illustrations, and represents theconsequences of mutations that are categorized into clinicalsignificance, molecular consequence, variation type, and pathogenicitybased on the SIFT and/or PolyPhen scores.

In some embodiments, exon frame module 260-5 applies a single and threereading frames predictions, and displays the exons, introns, andabnormalities identified in the ncRNA genes (tRNA, rRNA, miRNA, snoRNA,snRNA, and lncRNA), and further analyzes the sequence analysis andvisualizations of elements displayed on the gene sequence, depictingthem in different color codes. In some embodiments, exon frame module260-5 enables the user-guide of the platform such as the “About” thatautomatically provides context sensitive explanations for variousfeatures and applications, and “How To” that automatically providescontext sensitive information of how to use particular featuresthroughout the different sections of the platform. In some embodiments,exon frame module 260-5 determines and provides the ExonFrame statisticsanalyzed for all the genes in the human genome and various organisms,and displaying the information such as an average length of exons,introns, and ORFs, occurrences of stop codons in splice sites, codondistribution in splice sites, distribution of coding exon length, intronlength, ORF length across the genome, and randomly generated sequence intabular and graphical illustrations. In some embodiments, exon framemodule 260-5 enables the use of tightly coupled navigation byinterlinking different sections to provide analysis of a gene, protein,or other elements and features throughout the platform, illustratespatterns in expanded views, and shows more details in sequence views,and is applicable to various organisms including animals, plants, andmicrobes.

Protein signature module 260-6 enables the analysis of selected proteinfeatures in a genome, and their aberrations due to mutations that leadto diseases and other afflictions such as adverse drug reactions. Itfurther enables the visualization and analysis of various detailsincluding the exon-domain signatures, cryptic splice sites, and theprotein signature showing variable amino acids at each position of thedomains that provides a deeper understanding of the allowed andnon-allowed amino acids of the domains. When a gene is chosen in proteinsignature module 260-6, coding exons of the selected gene and transcriptare displayed with their corresponding domains overlaid as coloredlines. Mutations on these coding exons can be visualized by selectingthe mutation toggle option. On clicking the domains above their codingexons, domain details, and various types of signatures such as 20colors, Positive-Negative, Hydro, Cryptic splice, Alternative splicingand Whole protein signature, are displayed for further analysis. In someembodiments, protein signature module 260-6 performs or collectsalignment results from a third party database (e.g., database 252,including the Pfam database), including a seed alignment and a fullalignment. In some embodiments, a seed alignment includes a set ofmanually curated amino acids from the domain sequences from severalgenomes and thus tends to have a smaller number of amino acids than thefull alignment. In some embodiments, a full alignment includes a set ofamino acids produced from several genomes that are aligned using HiddenMarkov models, and the like.

In some embodiments, protein signature module 260-6 determines,analyzes, and illustrates the protein sequence signatures of a proteinand its domains, their associated features such as the hydropathy andsplicing, and the clinical and biological impacts of genetic mutations.In some embodiments, protein signature platform 260-6 provides a proteinchart to determine and illustrate the analysis of variable amino acidsin protein-coding sequences under three different tabs: ProteinOverview, Cryptic Splice Sites, and Variant density. In someembodiments, protein signature platform 260-6 converts the amino acidalignments from Pfam database into amino acid signatures of proteins andtheir domains, by identifying the variable amino acids and avoiding theredundant amino acids at each position, and by determining if an aminoacid occurs at greater than a specific fraction (e.g., 50%) of thealigned positions, thus incorporating a unique algorithm. In someembodiments, protein signature platform 260-6 defines an algorithm thatidentifies the different non-redundant amino acids at each position andincludes them as the variable or allowed amino acids at that position,taking into account any position with “.” or “-” in the alignmentindicating a gap, whereby a position with a particular frequency(e.g., >50%) of dots is defined as grey regions in the signature.

In some embodiments, protein signature platform 260-6 determines anddisplays the set of non-redundant AAs produced from the multiplesequence alignment (MSA), generating a unique signature of allowed AAsfor every sequence position, showing each of the 20 AAs in a distinctcolor, and defines that the allowed and non-allowed regions of thepositive-negative signature of a domain or protein determines thepathogenicity or deleteriousness of a variant by its occurrence in thepositive (green) or negative (red) region. In some embodiments, proteinsignature module 260-6 displays the non-redundant AAs from the multiplesequence alignment (for e.g., Pfam database) in one color (e.g., green),and all other AAs in another color (e.g., red), showing a map of allowed(positive) and non-allowed (negative) AA substitution space across thesequence, indicating variants that may result in a viable or defectiveprotein. In some embodiments, protein signature platform 260-6 findsthat the deleterious (pathogenic) mutations would fall within thenegative region (red) and that the benign or likely pathogenic mutationswould fall within the positive region (green), and applying this findingin testing and determining if a given variant is deleterious or not,determines the impact and clinical significance of the mutations basedon the occurrence of the altered amino acids within the negative aminoacid space or the positive amino acid space, thereby showing the aminoacids where the actual mutations occur by color codes, and depicts thesignature for the exon encoded domains in color codes based on ahydropathy scale. Protein signature platform 260-6 displays thehydrophobic AAs in shades of a particular color (e.g., red), andhydrophilic AAs shown in shades of another color (e.g., blue) to createa heat-map of hydropathy. In some embodiments, protein signatureplatform 260-6 determines the secondary structure map of the amino acidsignature using standard values of secondary structure, and depictingthem in different color codes thus creating a color-coded secondarystructure signature, which will change due to genetic mutations from asubject or from gene-mutation databases such as ClinVar, dbSNP, andCOSMIC. In some embodiments, protein signature module 260-6 defines thesecondary structure map of the amino acid signature using standardvalues of secondary structure, depicting them in different color codes,thus creating a color-coded secondary structure signature and enablingits illustration against the domain signature for the analysis ofsecondary structures correlating with signatures and mutations invarious amino acids. In some embodiments, protein signature module 260-6enables the illustration, visualization, and analysis of mutations inthe 3D structure of the domain along with the amino acid variability inthe allowed or non-allowed set of amino acids and correlating anddetermining the effects of the mutations in the domain.

In some embodiments, protein signature module 260-6 represents thestructure of coding exons in a gene by a shape such as an oval orrectangle, and overlaying the protein domains encoded by the exons, asavailable in Pfam database or predicted by PfamScan, or any other aminoacid alignment databases, correlating the clinical association ofmutations in the CDS with cancers and non-cancer disorders in auser-driven approach, displaying various details of domains encoded bythe exons such as domain identifier (PfamId), class, start and endposition of the transcript encoding the domain, and coding exons usingi-icons, mouse hovers, and context-sensitive popups, depicts thevariable amino acids in the key regions of human proteins, such asdomains and deriving the set of “allowed” amino acids by generating themultiple sequence alignments of diverse genomes, creating a signature ofpotential amino acid substitutions across the domain, and classifyingthe signature under different tabs including: 20 colors, PositiveNegative, Hydro, Cryptic Splice, Alternative Splicing, and whole proteinsignature, and illustrates the alignment of amino acids under twodifferent tabs: Seed and Full, and depicting the alignment whichcontains a set of allowed/curated amino acids in the Seed tab and thealignment which contains the set of amino acids produced by Pfam usingHidden-Markov models in the Full tab. In some embodiments, proteinsignature module 260-6 computes and depicts the signature of potentialamino acid substitutions across the domain in color codes based on thehydropathy (hydrophobic and hydrophilic) index, charge of amino acids,and determining its region and impact on the cryptic and alternativesplicing sites, creates and depicts the impression of the known aminoacid substitutions or subject(s) mutations that are likely to maintainthe structure and function of a given protein region, and the mutationsthat are likely to destroy the structure and function of the proteinthus leading to disease, depicts the exons that encode a domain byoverlaying the domains on the corresponding positions of the codon andAA sequences, and various features of domains and proteins against thegene sequence, and enables the selection of different score thresholdsto view any cryptic splice sites or cryptic exons that occur within theCDS of different exons in different color codes, thereby identifying thecryptic splice sites and cryptic exons within real exons, whosemutations can disrupt normal splicing leading to defective protein anddisease.

In some embodiments, protein signature module 260-6 depicts thepositions on the signature in which the human amino acid sequence has agap, but other genomes have amino acids, shown as a dash in the humansequence, in different color codes, and indicating the positions atwhich lesser or higher than a specific fraction of amino acids occurwith or without a gap (e.g., 50%) in the sequence signature. In someembodiments, protein signature module 260-6 provides toggle options toturn on the mutations to overlay known mutations on the signatures fromdifferent databases such as dbSNP, ClinVar, and COSMIC, categorized intoclinical significance, molecular consequence, variation type, andpathogenicity based on the SIFT and/or PolyPhen scores, and enabling theillustration of the amino acids, cryptic sites, and its scores ingraphical, tabular, and sequence with pop-up boxes, mouse hovers, andcontext sensitive explanations. In some embodiments, protein signaturemodule 260-6 analyzes cryptic sites and exons within the coding sequenceof a protein by determining and depicting the cryptic splice sites andcryptic exons, real splice sites and exon positions and their scores invarious color codes and shapes, based on different score thresholdswithin the coding exon sequences in tabular, graphical, and sequenceillustrations, analyzes the alternative splicing of the exons coding forthe domains and providing the signatures for the added or skipped regionof the exons coding for the domain, and enables the pattern analysis ofvariations in protein and domain sequence signatures for differenttranscripts of a given gene.

In some embodiments, protein signature module 260-6 displays the numberof samples for each variant from the COSMIC database for each domainposition, and depicts the positions of a specific variant in a color(e.g., red), and positions with more than one variant are depicted indifferent colors, for example, as follows: two variants->blue, threevariants->green, four variants->yellow (named as variant density plot),predicts the splice sites in the genes of any organism using Shapiro &Senapathy and relevant algorithms in an automated manner Predicting andassigning the score for cryptic exons based on the cryptic donor andacceptor splice site scores, and detecting which amino acid mutationwould make the protein defective based on the mutations from one or moresubjects within the protein signature, based on where the mutation fallswithin the positive or negative amino acid space, and determining whichmutations are correctly identified and which are incorrectly identified.In some embodiments, protein signature module 260-6 overlays thesubject(s) mutations on the gene, and provides visual and analyticalillustrations of the mutations from the subject(s) and known mutationsfrom various gene-mutation databases in graphical, tabular, and sequenceviews with pop-up boxes, mouse hovers, and context sensitiveexplanations, enables various search options using nested search boxesfor the user to choose the genes based on the domain, number of domainsin a gene, families, average AA substitutions, alignment type, diseaseassociated genes, domains using Pfam Identifier, and exceptional genes,and provides various information about the gene and its associatedelements such as protein family and domains, ontology information,disease phenotypes using i-icons, mouse hovers, and context-sensitivepopups.

In some embodiments, protein signature module 260-6 creates a repositoryfor the ProtSig platform containing information for genes in a genomesuch as exon details with the encoding domains, genomic position of theexons, transcript details, exon length, protein structure, real andcryptic splice sites and exons, and enabling the display and analysis ofthese features for any selected gene, enabling the search option forgenes from various gene panels such as disease panels, drug metabolizinggene (DMG) panel, the American College of Medical Genetics and Genomics(ACMG) gene panel, and other user given gene panels and enabling thedisplay and analysis of any gene by a query, and identifying structuralregions and allowed nucleotide variations as signatures, and themutations and disease relationships, in non-coding RNA genes (e.g.,tRNA, rRNA, miRNA, snoRNA, siRNA).

In some embodiments, protein signature module 260-6 enables theuser-guide of the platform such as the “About” that automaticallyprovides context sensitive explanations for various features andapplications, and “How To” that automatically provides context sensitiveinformation of how to use particular features throughout the differentsections of the platform, provides and analyzes statistics for genes invarious organisms, and displays various statistics and information suchas the number of unique domains from Pfam in multiple genomes, number ofprotein isoforms with different numbers of domains, average number ofdomains per protein across the proteome, average domain signaturecharacteristics, average number of exons across the genome, and enablesthe use of tightly coupled navigation by interlinking different sectionsto provide analysis of a gene, protein, or other elements and featuresthroughout the platform.

In some embodiments, protein signature module 260-6 represents theconsequences of mutation in these structures of the transcripts and thegene with graphical, tabular, and sequence illustrations, and plottingsubject mutations, and the known mutations from different databases suchas dbSNP, ClinVar, and COSMIC, and categorized into clinicalsignificance, molecular consequence, variation type, and pathogenicitybased on the SIFT and/or PolyPhen scores, overlays the subject(s)′mutation(s) on the gene and protein structure and sequence on which thereal and cryptic splice site and exon mutations are depicted, anddetermines the connection to various diseases, and enables an expandedversion for each of the features of ProtSig, which allows users tovisualize and analyze further details in graphical, tabular, andsequence illustrations. In some embodiments, protein signature module260-6 automatically updates various data from different databases suchas NCBI, ENSEMBL, and Pfam, and including other databases, and presentsthe latest information on different features such as genes, proteins,domains, mutations, and diseases, and is applicable for differentorganisms, including the human, other animals, microbial organisms, andplants.

UTR view module 260-7 identifies the various promoter elements, 5′ and3′ UTRs, poly-A sites, and various possible ORFs such as u-ORFs andd-ORFs, their sub classifications within these based on the specificstart and stop codons, and their disease connections.

In some embodiments, UTR view module 260-7 identifies genetic elementsin various tabs for analyzing the properties of promoters and UTRs intranscripts and mRNAs such as: mRNA sequence, splice score and promoter,displays the structure of mRNA transcript of a gene and illustrating andenabling the analysis of the properties of un-translated regions (UTRs)in human mRNA sequences, and enables the classification of exons in thetranscript into coding, partially-coding, or non-coding exons, providingsplice site sequences, and scores for each of them.

In some embodiments, UTR view module 260-7 locates any upstream anddownstream open reading frames (u-ORFs and d-ORFs) that surround thereal ORF (CDS), enables the determination of the Kozak consensussequences surrounding the start codon, and providing Kozak scores forthe identified ORFs in upstream and downstream regions, indicating whichORFs may be turned on in different biological contexts, and depicts thestructure and sequence of mRNAs and locates the sequence components suchas coding sequence, 5′/3′ UTRs, Poly-A signals, initiator ATG codons,stop codons that are in-frame with one or more ATGs, upstream ORFs(u-ORFs) and downstream ORFs (d-ORFs), and displays four differentclasses of ORFs in upstream and downstream regions of every mRNAtranscript of genes, in tabular, graphical, and sequence views.

In some embodiments, UTR view module 260-7 illustrates different ORFclasses such as u-ORF, r-ORF (real open reading frame), and d-ORFbetween 5′ and 3′ region of coding exons and depicts the occurrences ofstart and stop codons on the gene's mRNA and for every ORF classes in agraphical, and sequence view, determines the ORF classes and tabulatingthe features of them such as ORF type, ORF position, Kozak sequence,Kozak score, stop codon sequence, real stop codon score, and 4-base stopcodon score, and illustrating them in graphical and sequence view,displays the splice sites for all the exons in a transcript andcomputing scores using the Shapiro & Senapathy algorithm and otherrelevant algorithms, and calculating and displaying the exon scores bytaking the average of the acceptor and donor scores, and definesdifferent UTR and exon classes in a transcript, and categorizing them asfully coding exon (FCE), 5′ partially-coding exon (PCES), 3′partially-coding exon (PCE3), 5′ and 3′ partially-coding exon (PCE53),5′ non-coding exon (NCES), and 3′ non-coding exon (NCE3).

In some embodiments, UTR view module 260-7 identifies the promoter boxessuch as TATA, CAAT, GC, and transcription initiators in the gene bycomputing the scores with varying thresholds by adapting the Shapiro &Senapathy and other relevant algorithms for each of the identifiedpromoter boxes, enabling toggle options to visualize the variouspromoter boxes in graphical illustrations of gene structure andsequence. In some embodiments, UTR view module 260-7 predicts theclinical consequences of the subject's mutations in promoter boxes andpoly-A sites, and UTR elements in the gene graphically and in sequenceillustrations, determining their pathogenicity based on mutated scores,correlating with disease, and conducts similar analyses for the knownmutations from the different disease-gene databases such as dbSNP,ClinVar, and COSMIC, and categorized into clinical significance,molecular consequence, variation type, and pathogenicity based on theSIFT and/or PolyPhen scores. In some embodiments, UTR view module 260-7computes the strong poly-A signals and depicts them on the gene and mRNAin tabular, graphical, and sequence illustrations, and the disruptivemutations based on the scores, uses the Shapiro & Senapathy algorithm inthe identification of different elements of the promoter (boxes), poly-Asites, and UTR classes for a gene, and their cryptic versions, andidentifies real and cryptic promoter and poly-A motifs and elements byadapting and modifying other relevant algorithms such as MaxEntScan,NNSplice, and Human splicing Finder throughout the gene sequence andgenes in the genome and its application to subject and cohort genomics.

In some embodiments, UTR view module 260-7 identifies real and crypticpromoters and poly-A motifs and elements by adapting and modifying otherrelevant algorithms such as MaxEntScan, NNSplice, and Human splicingFinder throughout the gene sequence and genes in the genome. In someembodiments, UTR view module 260-7 identifies real and cryptic splicesites using promoter and poly-A motifs and elements by adapting andmodifying other relevant algorithms such as MaxEntScan, NNSplice, andHuman splicing Finder throughout the gene sequences and genes in thegenome and identifying the known mutations from databases such asClinVar, dbSNP, and COSMIC. In some embodiments, UTR view module 260-7identifies real and cryptic promoter and poly-A motifs and elements byadapting and modifying other relevant algorithms such as MaxEntScan,NNSplice, and Human splicing Finder throughout the gene sequences andgenes in the genome and identifying the mutations from subjects' genome.In some embodiments, UTR view module 260-7 enables various searchoptions using nested search boxes for the user to choose the genes basedon number of ORFs, number of promoter boxes, promoter box score, poly-Aboxes, poly-A box score, exon classes, disease associated genes,exceptional genes, and other parameters.

In some embodiments, UTR view module 260-7 provides various informationfor elements such as exons, mRNA elements, promoter elements, and UTRelements in a gene and their associated elements such as protein familyand domains, ontology information, disease phenotypes using i-icons,mouse hovers, and context-sensitive popups. In some embodiments, UTRview module 260-7 enables the search option for genes from various genepanels such as disease panels, drug metabolizing gene (DMG) panels, theAmerican College of Medical Genetics and Genomics (ACMG) gene panels,and other user given gene panels and enabling the display and analysisof these genes on UTR view platform. In some embodiments, UTR viewmodule 260-7 identifies and illustrates the exceptional gene exons withrare behaviors such as an in-frame stop codon, selenocysteine codon, orno stop codons present in the end of the CDS, and applies to non-codingRNA genes (e.g., tRNA, rRNA, miRNA, snoRNA, siRNA, lncRNA).

In some embodiments, UTR view module 260-7 enables the user-guide of theplatform such as the “About” that automatically provides contextsensitive explanations for various features and applications, and “HowTo” that automatically provides context sensitive information of how touse particular features throughout the different sections of theplatform. In some embodiments, UTR view module 260-7 provides the UTRview statistics analyzed for all the genes in various organisms anddisplays the information on frequency of different elements such aspromoter boxes, poly-A sites, and exons contained in coding andnon-coding regions, several different classes of ORFs (u-ORFs andd-ORFs), average Kozak and 4-base stop codon scores from the differentORF classes, and distribution of real and false 4-base stop codons. Insome embodiments, UTR view module 260-7 enables the use of tightlycoupled navigation by interlinking different sections to provideanalysis of a gene, protein, or other elements and features throughoutthe UTR view platform, updates latest data and information pertaining tothe elements described in UTR view with increasing data sources. In someembodiments, UTR view module 260-7 applies to all organisms includinghuman, other animals, plants, and microbial organisms, enables thedepiction of cryptic splice sites on the 5′ and 3′ UTR regions using theShapiro & Senapathy and other relevant algorithms, and analyzes subjectmutations and known mutations from databases such as ClinVar, dbSNP, andCOSMIC, overlaid on the genes to correlate their involvement in disease.

In some embodiments, UTR view module 260-7 identifies new promotermotifs and elements based on PWM methods, sliding window methods, motifsearch methods, methods using motif sequence lookups, and sequencealignment methods in long sequences up to more than 10,000 basesupstream of gene start. In some embodiments, UTR view module 260-7identifies motifs and elements that are target(s) of sequence specificpromoter binding proteins and genes (such as TP53, OBSCN, TAF3, andFAT3) based on PWM methods, sliding window methods, motif searchmethods, methods using motif sequence lookups, and sequence alignmentmethods in long sequences up to more than 10,000 bases upstream of genestart. In some embodiments, UTR view module 260-7 identifies new genecontrol motifs and elements including promoter silencers and enhancers,based on PWM methods, sliding window methods, motif search methods,methods using motif sequence lookups, and sequence alignment methods inlong sequences up to more than 10,000 bases upstream of gene start. Insome embodiments, UTR view module 260-7 identifies new poly-A sitemotifs and poly-A site recognition motifs based on PWM methods, slidingwindow methods, motif search methods, methods using motif sequencelookups, and sequence alignment methods in long sequences up to morethan 10,000 bases downstream of CDS end and gene end.

In some embodiments, UTR view module 260-7 identifies the promotermotifs by combining each of the shorter promoter elements such as TATA,CAAT, and GC, and, in addition, the transcription start site (TSS)calculates promoter motif score by combining the scores of individualpromoter elements such as TATA, CAAT, and GC, defining the strength ofthe promoter, and defines other transcriptional regulating elements suchas enhancers and silencers and determining their combined scores. Insome embodiments, UTR view module 260-7 determines the poly-A motifs bycombining additional signals such as T/GT-rich downstream sequenceelements, T-rich upstream sequence elements, G-rich auxiliary downstreamelements, and TGTA elements, calculates poly-A motif score by combiningthe different scores of each of the elements such as T/GT-richdownstream sequence elements, T-rich upstream sequence elements, G-richauxiliary downstream elements, and TGTA elements, and identifies andanalyzes the mutations in these promoter and poly-A motifs and theirimplications in disease causation. In some embodiments, UTR view module260-7 identifies subject mutations in potential gene control elements inlong sequences up to more than 10,000 bases upstream of gene start anddownstream of gene end, using tools provided within Genome Explorer andSplice Atlas.

BPS view module 260-8 predicts, illustrates, and analyzes BPSs in one,or multiple genes from a genome, identifies the mutations from subjectsor known mutations within BPSs, and identifies the mutations fromsubjects or known mutations within branch point sites and theircorrelation with cancers and other diseases. In some embodiments, BPSview module 260-8 uses algorithm 250 (e.g., the Shapiro & Senapathyalgorithm and other relevant algorithms) in genes in a genome, andrepresents BPSs on the gene and genomic scale of individual subjects andcohorts of subjects. In some embodiments, BPS view module 260-8 enablesthe discovery of frequently mutated genes within the BPS from subjects,and correlates the molecular details of the structure/function andaberrations in these genes with the phenotypes, traits, and disease ordrug responses.

In some embodiments, BPS view module 260-8 predicts the splicingalterations and aberrations and their effect in splicing the transcriptresulting in a defective protein based on the mutations within thebranch point regions in a gene, and displays the branch point mutationsand their effects on splicing (e.g., intron retention, exon skipping,and cryptic exons inclusions) in the transcripts, mRNA and protein withgraphical, sequence, and tabular illustrations of the gene, RNA andprotein from subjects. In some embodiments, BPS view module 260-8displays the frequency of branch point mutations in genes in the genomein one view, and their effects on splicing (e.g., intron retention, exonskipping, and cryptic exons inclusions) in the transcripts, mRNA andprotein with graphical, sequence, and tabular illustrations of the gene,RNA and protein from subjects. In some embodiments, BPS view module260-8 identifies cryptic branch point sequences within exons and intronsand throughout the gene sequence, and in genes across the genome, withvarying range of score thresholds calculated based on the Shapiro &Senapathy algorithm and other available algorithms, and depicting themgraphically on the gene structure, sequence, and tabular illustrations.In some embodiments, BPS view module 260-8 displays the branch pointmutations in one or more genes individually or on the genome-scale inone view, from the mutation databases such as dbSNP, ClinVar, andCOSMIC, and their effects on splicing (e.g., intron retention, exonskipping, cryptic exons inclusions) in the transcripts, mRNA and proteinwith graphical, sequence, and tabular illustrations of gene, RNA, andprotein from subjects.

In some embodiments, BPS view module 260-8 displays the mutations in thecryptic branch point sites on one or more genes individually or on thegenome-scale in one view, from the variant databases such as dbSNP,ClinVar, and COSMIC, and their effects on splicing (e.g., intronretention, exon skipping, cryptic exons inclusions) in the transcripts,mRNA and protein with graphical, sequence, and tabular illustrations ofgene, RNA, and protein from a subject or cohort of subjects. In someembodiments, BPS view module 260-8 enables the discovery of frequentlymutated genes within the cryptic branch point regions from subjects, andcorrelating the molecular details of the structure/function andaberrations in these genes with the phenotypes, traits, and disease ordrug response, analyzes branch point mutations from a subject, overlaidon the genes to correlate their involvement in disease, and enables theidentification, visualization, and deeper analysis of branch points andother regulatory elements and their cryptic versions, individually andin combinations, in a single application. In some embodiments, BPS viewmodule 260-8 builds sub-PWMs for the non-canonical branch pointssurrounding the first downstream base from the 3′ intron end, enablesthe visualization and deeper analysis of branch points and otherregulatory elements and their cryptic versions, individually and incombinations, in a single application, enables the analysis of BPS fromsingle subjects or cohort of subjects, and enables the analysis of BPSand its combinations with different coding and regulatory elements fromsingle subjects or cohort of subjects in a single application.

In some embodiments, BPS view module 260-8 predicts, illustrates, andanalyzes a BPS in one, multiple, or genes from the genomes of organismsincluding the human, other animals, plants, and eukaryotic microbialorganisms. In some embodiments, BPS view module 260-8 enables theuser-guide of the platform such as the “About” that automaticallyprovides context sensitive explanations for various features andapplications, and “How To” that automatically provides context sensitiveinformation of how to use particular features throughout the differentsections of the platform. In some embodiments, BPS view module 260-8provides statistics analysis for the genes in various organisms anddisplaying various statistics and information across the genome, enablesthe use of tightly coupled navigation by interlinking different sectionsto provide analysis of a gene, protein, or other elements and featuresthroughout the platform, and enables an expanded version for each of thefeatures of BPS view module 260-8, which allows users to visualize andanalyze further details in graphical, tabular, and sequenceillustrations.

Regulatory module 260-9 identifies promoter enhancer and silencerregions at a distance from the promoter, at the 5′ or 3′ sides of thegene or within exons and introns, or at remote locations on the same orother chromosomes. Enhancers and silencers of polyadenylation signalsare also found at the 5′ or 3′ sides of the gene or within exons andintrons. Enhancers and silencers of splicing are found within exons andintrons and other regions of the gene. Regulators of trans-splicing mayoccur remotely on the same or other chromosomes. In some embodiments,regulatory module 260-9 identifies short sequence motifs that containbinding sites for transcription factors and other binding proteins andactivate their target genes by binding to specific sequences. In someembodiments, regulatory module 260-9 identifies silencers that suppressthe gene expression, splicing, or other processes. Although the enhancerDNA may be far from the gene in a linear way, it may be spatially closeto the promoter and gene. This allows the enhancer sequence to interactwith the general transcription factors and RNA polymerase II. The samemechanism holds true for silencers. Silencers are antagonists ofenhancers that, when bound to its proper transcription factors calledrepressors, repress the transcription of the gene. In some embodiments,regulatory module 260-9 identifies an enhancer located within severalhundred thousand bases upstream or downstream of the gene it regulates.Enhancers act by binding to activator proteins and not on the promoterregions. These activator proteins interact with the mediator complex,which recruits polymerase II and the general transcription factors whichthen begin transcribing the genes. Enhancers can also be found withinintrons. In addition, enhancers can be found at the exonic region of anunrelated gene, and may act on genes on another chromosome. In someembodiments, regulatory module 260-9 identifies the trans-actingsplicing activator and splicing repressor proteins as well as cis-actingelements within the pre-mRNA itself such as enhancers and silencers.These sequences are located within both exons and introns that eitherenhance or suppress splicing. In some embodiments, regulatory module260-9 identifies exonic splicing enhancers (ESEs) and intronic splicingenhancers (ISEs) that activate or enhance the splicing process, fromwithin exons while intronic splicing enhancers (ISEs) and silencers(ISSs) suppress the splicing process from within introns.

In some embodiments, regulatory module 260-9 identifies cis-regulatoryelements i.e., exonic and intronic splicing enhancers (ESE and ISE,respectively) and exonic and intronic splicing silencers (ESS and ISS,respectively) by recognizing specific splicing repressors and activators(trans-acting elements) that help to properly carry out the splicingprocess. In some embodiments, regulatory module 260-9 identifiessplicing enhancers to which splicing activator proteins bind, increasingthe probability that a nearby site will be used as a splice junction.These also may occur in the intron (intronic splicing enhancers, ISE) orexon (exonic splicing enhancers, ESE). In some embodiments, regulatorymodule 260-9 identifies an exonic splicing enhancer (ESE) consisting of˜6 bases within an exon that enhances accurate splicing of pre-mRNA intothe mRNA. Most of the activator proteins that bind to ISEs and ESEs aremembers of the SR protein family Such proteins contain RNA recognitionmotifs and arginine and serine-rich (RS) domains. In some embodiments,regulatory module 260-9 identifies splicing silencers to which splicingrepressor proteins bind, reducing the probability that a nearby sitewill be used as a splice junction. These can be located in the intronitself (intronic splicing silencers, ISS) or in a neighboring exon(ESS). An ESS is a short region (usually 4-18 nucleotides) of an exon,which inhibits or silences splicing of the pre-mRNA and contributes toconstitutive and alternative splicing. The majority of splicingrepressors are heterogeneous nuclear ribonucleoproteins (hnRNPs) such ashnRNPA1 and polypyrimidine tract binding protein (PTB).

In some embodiments, regulatory module 260-9 identifies point mutationsin exons that inactivates an ESE, can create an ESS, which in turn canlead to alternative events like exon skipping and eventually a truncatedprotein resulting in genetic disorders. Mutations in these regions areof very high significance as these are implicated in numerous cancersand non-cancer disorders. Also, the adaptive significance of splicingsilencers and enhancers is further attested by multiple studies showingthat there is a strong selection in human genes against mutations thatproduce new silencers or disrupt existing enhancers. In someembodiments, regulatory module 260-9 identifies cryptic enhancers, andsilencers also have great impact in gene expression, splicing, andtranslation. These cryptic regulators may be present anywhere in thegenome and affect the gene expression and splicing on a large scale onaccount of mutational aberrations within them. Mutations in the crypticsites may increase their scores (calculated using modified Shapiro &Senapathy algorithms and other algorithms), which may lead tosuppression of gene expression or regulation of unwanted genes.

In some embodiments, regulatory module 260-9 creates a map to enable theprediction, illustration, and analysis of enhancers and silencers andtheir cryptic versions, and their mutational aberrations, employing themodified Shapiro & Senapathy algorithm and other relevant algorithms inany genomes including human and other organisms. In addition, itprovides a platform for predicting and analyzing the effects of knownmutations in these regulatory elements as well as mutations fromindividual subjects' genomes and the genomes from subject cohorts. Insome embodiments, regulatory module 260-9 identifies regulators oftrans-acting elements for gene regulation and splicing that may occurremotely on the same or other chromosomes. Splice Atlas identifies thecis-acting enhancer and silencer motifs and elements, their crypticversions, and mutations, based on several methodologies throughout thegene, and trans-acting enhancer and silencer motifs and elements usingsimilar methods at remote locations on the same or differentchromosomes. In addition, Splice Atlas also identifies the crypticversions of regulatory and splicing elements and mutations within them.

In some embodiments, regulatory module 260-9 identifies and illustratesmutations from a third party database such as dbSNP, ClinVar, and COSMICand are also retrieved and overlaid over these enhancer and silencersites. In addition, mutations from the individual subjects' genome andfrom a cohort of subjects are also identified and plotted over the geneplot. Enhancers and silencers for polyadenylation sites are alsodetermined using similar methods. The details of elements including thegene regulating enhancers and silencers, splicing enhancer and silencersand their cryptic forms are illustrated on gene plots in compact andexpanded view, tabular forms, and detailed sequence views. Mutations inthese elements and the molecular details of aberrations are alsoillustrated and enabled for interpretation and analysis. In someembodiments, regulatory module 260-9 provides a map of enhancers andsilencers for predicting, illustrating, and analyzing the regulatoryelements in genes, identifies and analyzes the mutations from subjectsin these elements, and correlating with clinical impacts. In someembodiments, regulatory module 260-9 identifies the exon and intronsplicing enhancers (ESEs & ISEs) or silencers (ESSs & ISSs) by adaptingthe modified Shapiro & Senapathy algorithm and other relevant algorithmsin genes in a genome, and representing them on the gene and genomicscale of individual subjects and cohorts of subjects. In someembodiments, regulatory module 260-9 identifies known mutations fromsources such as dbSNP, ClinVar, and COSMIC, in the splicing enhancers(ESEs & ISEs) or silencers (ESSs & ISSs) in the genes of subjects and incohorts of subjects, and their analysis in correlation with variousdiseases.

In some embodiments, regulatory module 260-9 identifies frequentlymutated genes within the splicing enhancers or silencer regions from anindividual or cohort of subjects, and correlating the molecular detailsof the structure/function and aberrations in these genes with thephenotypes, traits, or drug responses, and identifies mutations in thesplicing enhancers or silencers responsible for aberrations involved inadverse drug reactions and affecting the efficacy of varied drugs in asubject. In some embodiments, regulatory module 260-9 displays themutations in the splicing enhancer or silencer sites on one or moregenes individually or on the genome-scale in one view, from the variantdatabases such as dbSNP, ClinVar, and COSMIC, and their effects onsplicing (e.g., intron retention, exon skipping, cryptic exonsinclusions) in the transcripts, mRNA and protein with graphical,sequence, and tabular illustrations of gene, RNA, and protein from asubject or cohort of subjects. In some embodiments, regulatory module260-9 displays mutations in the cryptic splicing enhancer and silencersequences on one or more genes individually or on the genome-scale inone view, from the variant databases such as dbSNP, ClinVar, and COSMIC,and their effects on splicing (e.g., intron retention, exon skipping,cryptic exons inclusions) in the transcripts, mRNA and protein withgraphical, sequence, and tabular illustrations of gene, RNA, and proteinfrom subjects. In some embodiments, regulatory module 260-9 providesinteractive visualizations and analytical capabilities for focusing onthe mutations in various splicing enhancers or silencer regionsindividually on gene structures and on a genomic scale, and facilitatingthe ability to perform analysis on enhancers and silencers acrosssubjects.

In some embodiments, regulatory module 260-9 identifies cryptic splicingenhancer and silencer sequences within exons and introns and throughoutthe genes across the genome, with varying range of score thresholdscalculated based on the modified Shapiro & Senapathy algorithm and otheravailable algorithms, and depicting them graphically on the genestructure, sequence, and tabular illustrations. In some embodiments,regulatory module 260-9 identifies the mutations from subjects on thecryptic splicing enhancer and silencer sequences, and their effects onsplicing (e.g., intron retention, exon skipping, and cryptic exonsinclusions) in the transcripts, mRNA and protein structures andfunctions. In some embodiments, regulatory module 260-9 predicts andidentifies regulators of trans-splicing that may occur remotely on thesame or other chromosomes, identifies cis-acting enhancer and silencermotifs and elements based on PWM methods, uses sliding window methods,motif search methods, methods using motif sequence lookups, and sequencealignment methods throughout the gene, and identifies trans-actingenhancer and silencer motifs and elements based on PWM methods, slidingwindow methods, motif search methods, methods using motif sequencelookups, and sequence alignment methods throughout the gene on remotelocations on the same or different chromosomes.

In some embodiments, regulatory module 260-9 analyzes enhancers andsilencers in gene expression, splicing and translation, and theircryptic versions, and its combinations with different coding andregulatory elements from single subjects or cohort of subjects in asingle application. In some embodiments, regulatory module 260-9includes a platform for predicting, illustrating, and analyzing theenhancer and silencer sequences in one, multiple, or genes from thegenomes of organisms including the human, other animals, plants, andeukaryotic microbial organisms. In some embodiments, regulatory module260-9 enables a user-guide of the platform such as the “About” thatautomatically provides context sensitive explanations for variousfeatures and applications, and “How To” that automatically providescontext sensitive information of how to use particular featuresthroughout the different sections of the platform. In some embodiments,regulatory module 260-9 provides statistics analysis for all the genesin various organisms and displays various statistics and informationacross the genome, enables the use of tightly coupled navigation byinterlinking different sections to provide analysis of a gene, protein,or other elements and features throughout the platform, and enables anexpanded version for each of the features of enhancer/silencer view,which allows users to visualize and analyze further details ingraphical, tabular, and sequence illustrations.

In some embodiments, regulatory module 260-9 includes a modified versionof the Shapiro & Senapathy algorithm. In some embodiments, regulatorymodule 260-9 includes modified versions of other splicing algorithmssuch as MaxEntScan, NNSplice, and Human Splicing Finder. In someembodiments, regulatory module 260-9 detects each of the differentregulatory elements (promoter boxes such as TATA box, CAT box, GC box),promoter element, transcription initiator, branch point, exon spliceenhancer and silencer (ESE & ESS), intron splice enhancer and silencer(ISE & ISS, poly-A site) based on the specific position weight matrix(PWM) derived from the respective consensus sequence frequencies andsequence length of each regulatory element. In some embodiments,regulatory module 260-9 detects the cryptic versions of each of thedifferent regulatory elements (promoter boxes—TATA box, CAT box, GCbox—, promoter element, transcription initiator, branch point, exonsplice enhancer and silencer—ESE & ESS—, intron splice enhancer andsilencer—ISE & ISS—, and poly-A site). In some embodiments, regulatorymodule 260-9 detects the different regulatory elements, and theircryptic versions, throughout the gene, and throughout multiple genes orall genes within a genome.

In some embodiments, regulatory module 260-9 detects the differentregulatory elements, and their cryptic versions, throughout the exonswithin a gene, throughout multiple genes or all genes within a genome,detects the different regulatory elements, and their cryptic versions,throughout the introns within a gene, throughout multiple genes or allgenes within a genome, and detects the different regulatory elements,and their cryptic versions, throughout the un-transcribed (promoter andupstream, poly-A site and downstream) and un-translated regions (5′ and3′ UTR) within a gene, throughout multiple genes or all genes within agenome. In some embodiments, regulatory module 260-9 detects thedifferent regulatory elements, and their cryptic versions, throughoutthe intergenic regions of a genome, detects cryptic exons, throughoutthe exons within a gene, throughout multiple genes or all genes within agenome, detects cryptic exons throughout the un-transcribed (promoterand upstream, poly-A site and downstream) and un-translated regions (5′and 3′ UTR) within a gene, throughout multiple genes or all genes withina genome. In some embodiments, regulatory module 260-9 detects crypticexons throughout the introns within a gene, throughout multiple genes orall genes within a genome, detects cryptic exons throughout theintergenic regions of a genome, and identifies deleterious mutationswithin splice sites to detect the deleterious mutations within each ofthe different regulatory elements (promoter boxes—TATA box, CAT box, GCbox—, promoter element, transcription initiator, branch point, exonsplice enhancer and silencer—ESE & ESS—, intron splice enhancer andsilencer—ISE & ISS—, poly-A site) based on the specific position weightmatrix (PWM) derived from the respective consensus sequence frequenciesand sequence length of each regulatory element.

In some embodiments, regulatory module 260-9 detects deleteriousmutations within each of the different regulatory elements (promoterboxes: TATA box, CAT box, GC box), promoter element, transcriptioninitiator, branch point, exon splice enhancer and silencer (ESE & ESS),intron splice enhancer and silencer (ISE & ISS), and poly-A site, anddetects deleterious mutations within the cryptic versions of each of thedifferent regulatory elements (promoter boxes (TATA box, CAT box, GCbox), promoter element, transcription initiator, branch point, exonsplice enhancer and silencer (ESE & ESS), intron splice enhancer andsilencer (ISE & ISS), and poly-A site). In some embodiments, regulatorymodule 260-9 detects deleterious mutations within the differentregulatory elements, and their cryptic versions, throughout the gene,and throughout multiple genes or all genes within a genome, detectsdeleterious mutations within the different regulatory elements, andtheir cryptic versions, throughout the exons within a gene, throughoutmultiple genes or all genes within a genome, and detects deleteriousmutations within the different regulatory elements, and their crypticversions, throughout the introns within a gene, throughout multiplegenes or all genes within a genome.

In some embodiments, regulatory module 260-9 detects deleteriousmutations within the different regulatory elements, and their crypticversions, throughout the un-transcribed (promoter and upstream, poly-Asite and downstream) and un-translated regions (5′ and 3′ UTR) within agene, throughout multiple genes or all genes within a genome, detectsdeleterious mutations within the different regulatory elements, andtheir cryptic versions, throughout the intergenic regions of a genome,detects deleterious mutations within cryptic exons throughout the exonswithin a gene, throughout multiple genes or all genes within a genome,and detects deleterious mutations within cryptic exons throughout theintrons within a gene, throughout multiple genes or all genes within agenome. In some embodiments, regulatory module 260-9 detects:deleterious mutations within cryptic exons throughout the un-transcribed(promoter and upstream, poly-A site and downstream) and un-translatedregions (5′ and 3′ UTR) within a gene, throughout multiple genes or allgenes within a genome, detects deleterious mutations within crypticexons throughout the intergenic regions of a genome, finds mutationswithin the new genes discovered within the introns and intergenicregions, and identifies splice sites (such as MaxEntScan, NNSplice,Human Splicing Finder) to detect each of the different regulatoryelements (promoter boxes (TATA box, CAT box, GC box), promoter element,transcription initiator, branch point, exon splice enhancer and silencer(ESE & ESS), intron splice enhancer and silencer (ISE & ISS), poly-Asite, based on the specific position weight matrix (PWM) derived fromthe respective consensus sequence frequencies and sequence length ofeach regulatory element.

In some embodiments, ncRNA map module 260-10 identifies and illustratesncRNA genes from the human genome, and their splicing and processinginto the mature functional RNA molecules in tabular, graphical, andsequence illustrations, and creates a repository for the non-coding RNAgenes platform containing all possible information for ncRNA genes in agenome such as exon details with the genomic position of the exons,transcript details, exon length, splicing and maturation processes, andconsequences of the mutations. In some embodiments, ncRNA map module260-10 identifies mutations in the non-coding RNA genes by modifying andapplying the Shapiro & Senapathy algorithm and other relevant algorithmsacross the gene and genomic scale from individual subjects and in acohort of subjects, and enabling the clinicians to correlate themutations in non-coding RNA genes that drive disease pathogenesis, andidentifies mutations in the regulatory elements of the non-coding RNAgenes responsible for disease-causing, adverse drug reactions andaffecting the efficacy of various drugs in a subject. In someembodiments, ncRNA map module 260-10 identifies known disease-causingmutations in different ncRNA genes, and using them to predict ordiagnose mutations and diseases from the subject genome, parses theidentified mutations in non-coding RNA genes against the curated GenomeExplorer proprietary mutation database, enabling to distinguish andcategorize the known and novel mutations of non-coding RNA genesreported in the individual and cohort subjects, and identifiesstructural and functional motifs and elements in the non-coding (nc) RNAgenes (rRNA, tRNA, miRNA, snRNA, snoRNA, siRNA, lncRNA).

In some embodiments, ncRNA map module 260-10 identifies disease-causingmutations in different ncRNA genes, predicting or diagnosing, mutationsand diseases from the subject genome, and known disease-causingmutations in different ncRNA genes, using them to predict or diagnosemutations and diseases from the subject genome. In some embodiments,ncRNA map module 260-10 identifies sequence signals for processingdifferent ncRNA genes to their mature forms using the modified Shapiro &Senapathy and other algorithms based on consensus, PWMs, and otherrelevant parameters for all ncRNA genes, and compares subject ncRNA genesequences with reference sequences to identify mutations using modifiedShapiro & Senapathy and other relevant algorithms based on the scoredifference between the normal and the mutated signals. In someembodiments, ncRNA map module 260-10 identifies subjects with frequentlyoccurring mutations in the structural and functional motifs and elementsin the non-coding (nc) RNA genes (rRNA, tRNA, miRNA, snRNA, snoRNA,siRNA, lncRNA), enables the visualization and analysis of variabilitywithin the ncRNA sequence positions to determine disease associations,and defines the allowed (green) and non-allowed (red) regions of thepositive-negative signature of an ncRNA gene from the alignment ofvarious types of ncRNA genes, and determines the pathogenicity ordeleteriousness of a variant by its occurrence in the positive (green)or negative (red) region of the ncRNA signature.

In some embodiments, ncRNA map module 260-10 displays non-redundantbases from the multiple sequence alignment of ncRNA genes from variousorganisms in one color (e.g., green), and all other bases in anothercolor (e.g., red), showing a map of allowed (positive) and non-allowed(negative) nucleotide substitution space across the sequence, indicatingvariants that may result in a viable or defective regulatory RNA. Insome embodiments, ncRNA map module 260-10 determines that thedeleterious mutations would fall within the negative region (red) andthat the benign or likely pathogenic mutations would fall within thepositive region (green), and applying this finding in testing anddetermining if a given variant is deleterious or not, and determineswhether the impact and clinical significance of the mutations isdeleterious or not, based on the occurrence of the altered base withinthe negative space or the positive space, thereby showing where theactual mutations occur by color codes. In some embodiments, ncRNA mapmodule 260-10 provides interactive visualizations and analyticalcapabilities for focusing on the mutations in various non-coding RNAgenes individually on gene structures and on a genomic scale,facilitating the ability to perform non-coding RNA gene analysis acrossindividual and multiple subjects and cohorts, which may be involved inthe regulation of gene expression, splicing, transcriptional andtranslational control, chromatin remodeling, and cell proliferation. Insome embodiments, ncRNA map module 260-10 identifies new disease-causingmutations in non-coding RNA genes based on individual and cohortanalysis, and providing a range of therapeutic targets and enabling andexploiting the development of RNA-based therapeutics, enables the searchoption for genes from various gene panels such as disease panels, andother user given gene panels, and enables toggle options for displayingthe graphical illustrations of details of every ncRNA gene and plottingthe mutations in an expanded view.

In some embodiments, ncRNA map module 260-10 identifies and analyzesexon splicing of an ncRNA gene, plotting the subject's mutation or anyknown mutations in the ncRNA genes from different databases such asdbSNP, ClinVar, and COSMIC, the effect of mutations such as suppressionof gene expression, splicing, and transcriptional regulation based onthe indigenous algorithm of ncRNA MAP, and analyzes subject mutationsoverlaid on the ncRNA genes to correlate their involvement in disease.In some embodiments, ncRNA map module 260-10 identifies mutations fromdatabases such as ClinVar, dbSNP, and COSMIC on the ncRNA genes tocorrelate their involvement in disease, enables the analysis of ncRNAfrom single subjects or cohort of subjects, enables the analysis ofncRNA mutations in various combinations of different regulatory elementsfrom single subjects or cohort of subjects in a single application, andidentifies the non-coding RNA sequences in one, multiple, or genes fromthe genomes of organisms including the human, other animals, plants, andeukaryotic microbial organisms. In some embodiments, ncRNA map module260-10 enables the user-guide of the platform such as the “About” thatautomatically provides context sensitive explanations for variousfeatures and applications, and “How To” that automatically providescontext sensitive information of how to use particular featuresthroughout the different sections of the platform, provides ncRNA mapstatistics genes in various organisms and displays various statisticsand information across the genome. In some embodiments, ncRNA map module260-10 enables tightly coupled navigation by interlinking differentsections to provide analysis of a gene, protein, or other elements andfeatures throughout the platform, and enables an expanded version foreach of the features of the ncRNA map, which allows users to visualizeand analyze further details in graphical, tabular, and sequenceillustrations.

Dark matter module 260-11 identifies protein genes within the darkmatter genome using various algorithms such as Shapiro & Senapathy,Splice Atlas Splice Code, GenScan, Augustus, and GeneID, and identifiesncRNA genes within the dark matter genome using various algorithms. Insome embodiments, dark matter module 260-11 identifies potential domainsfrom the protein-coding genes of the dark matter genome using PfamScanand other algorithms, and applies each of modules 260, on newly foundgenes to integrate relevant data and information into database 252.

Application 222 may be installed by server 130 and perform scripts andother routines provided by server 130 to display graphic payload 225provided by genome sequence analysis engine 242. In some embodiments,graphic payload 225 may include a mark for a mutation of the nucleotidestring on a positive signature and a negative signature of a proteindomain. In some embodiments, graphic payload 225 may include i-icons,mouse hovers with context sensitive pop ups for the user, pull downmenus, sliding windows and scales, active tabs and buttons, and otherinteractive elements that enable the user to retrieve more detailedinformation. In some embodiments, graphic payload 225 may enable toggleoptions for displaying the graphical illustrations of a selected gene orportion of the human genome, and plotting the corresponding mutations inan expanded view. Further, graphic payload 225 may enable a user-guidetab (e.g., including an “About” option that automatically providescontext sensitive explanations for various features and applications,and a “How To” that automatically provides context sensitive informationof how to use and navigate particular features throughout the differentsections of modules 260). Embodiments as disclosed herein enable the useof tightly coupled navigation features interlinking different sectionsand modules 260 to provide analysis of a selected gene, protein, orother elements and features throughout the platform.

FIGS. 3A-3F illustrate details of exon splices 300A, 300B, 300C, 300D,300E, and 300F (hereinafter, collectively referred to as “exon splices300”), according to embodiments disclosed herein. In some embodiments,exon splices 300 may be provided by an exon splice module interactingwith a genome sequence analysis engine, as disclosed herein (e.g., exonsplice module 260-1, and genome sequence analysis engine 242). Exonsplices 300 may include an “after splicing view” to illustrate theconsequence of excluding one or more exons that correspond to a proteindomain during RNA splicing. In some embodiments, the “after splicingview” illustrates the disrupted protein product post-splicing, includingthe spliced exon structure and the protein product. The disruptedprotein may undergo tolerated changes or destructive changes. Exonsplices 300 indicate the protein changes on both the micro (nucleotidescale) and the macro (protein structure) scales, enabling disease andbiological correlations.

In some embodiments, the coding exon-domain plot for the selected geneis visualized with exons in grey rectangles and domains overlaid onthem. Domains are predicted using a search engine to search a proteinsequence for the presence of domains encoded by the gene. Theconsequences of splicing any of the exons or set of exons coding for aparticular domain are predicted based on the S&S algorithm and the codondegeneracy principle. The reading frame of the resultant ORF aftersplicing out every exon individually, is checked for its correctness. Ifthe frame of the ORF is shifted due to exon excision, introduction of apremature termination codon (PTC), or the deletion of domain codingsequence (single domain or multiple domains) are combined to depictmultiple possible consequences.

Exon splices 300 may be provided for genes from various panels fordifferent diseases, with user preferences accommodated for diseaseslist, gene names, and transcript identifiers. The most probable anddestructive splice events are indicated for the chosen diseasediagnosis. Exon splices 300 may indicate protein isoforms associatedwith a selected sequence of exons 310 codifying different proteindomains 320-1, 320-2, and 320-3 (hereinafter, collectively referred toas “protein domains 320”). Exons 310-1, 310-2, 310-3, 310-4, 310-5,310-6, 310-7, 310-8, and 310-9 (hereinafter, collectively referred to as“exons 310”) include portions of a nucleotide string 301 coding aminoacids in a protein having protein domains 320. Exon splices 300A, 300B,and 300C include after splicing views of exons 310. Exon splice 300Aincludes a hydropathy view of protein domains 320, and exon splice 300Dincludes a sequence view listing the nucleotide string of the exonchain. Exon splices 300 may be provided in a graphic payload of anapplication running in a client device and hosted by a genome sequenceanalysis engine in a server, as disclosed herein. For example, exonsplices 300 may include a display of an amino acid hydropathy chart 330,listing the hydrophobicity and/or hydrophilicity of each amino acid. Thehydro section aids in visualizing the signature of the selected Pfam IDbased on the values of their hydropathy index. The signature plot inthis section is color-coded based on the hydropathy index scale, wherehydrophobic amino acids are represented in shades of red and hydrophilicamino acids are shown in shades of blue. Accordingly, a hydropathypattern 333-1, 333-2, and 333-3 (hereinafter, collectively referred toas “hydropathy patterns 333”) may be displayed for each of the proteindomains 320. Hydropathy patterns 333 depict hydropathy index valuesdetermined by various methods along the amino acids sequence of theselected transcript. In some embodiments, hydropathy patterns 333 may beindicated using a sliding window. In some embodiments, hydropathypatterns 333 may display the amino acids in color codes based on thehydropathy nature of amino acids, and the exon splice module may enablea mouse hover on each of the amino acids in the graphic payloadsupporting exon splice 300C so the user may view the corresponding codonin the nucleotide string.

In some embodiments, a hydropathy score in hydropathy pattern 333 may becalculated by a moving average of several adjacent amino acids, and exonsplices 300 may enable mouse hover on the amino acid sequence or plot toview the hydropathy values. In some embodiments, exon splices 300illustrate a hydropathy pattern 333 as a pattern of hills and valleys,showing the balance between hydrophobic and hydrophilic amino acidsencoded by exons 310. In some embodiments, hydropathy pattern 333includes a disruptive hydropathy indicating an imbalance in thehydropathy nature of amino acids caused by mutation.

FIG. 3A and FIG. 3B illustrate exon splices 300A and 300B includingmutations and other protein effects such as amino acid maintained 350-1,amino acid change 350-2, frameshift 350-3, premature termination codon(PTC) 350-4, domain lost 350-5, domain disrupted 350-6, and domainencoded by an exon that also encodes a neighboring domain 350-7(hereinafter, collectively referred to as “protein effects 350”) in theprotein sequence. Exon splice 300B also includes a tab 342 to select adatabase source (cf. database 252, e.g., dbSNP ClinVar, COSMIC), and atab 344 for identifying a mutation impact (e.g., clinicalsignificance=deleterious) for a selected mutation. The mutations curatedfor the selected gene can be depicted on the plot by configuring themutations toggle. The mutation details fetched from the respectivedatabases are displayed on hover of a particular mutation. The clinicalsignificance may include ontology information of a disease and a diseasephenotype. In some embodiments, exon splice 300B may display a mutationdetails window 350B indicating details of a selected mutation such asmutation position, mutation source (e.g., database 252), mutation ID,exon number (where the mutation occurs), CDS position, codon change,amino acid change, and a score factor for the mutation calculated by aselected scoring algorithm. Exon splice 300B also provides for the usera tab 346 to select the score algorithm to evaluate the mutation scorefactor (e.g., SIFT, PolyPhen, cf. algorithm 250). Exon splices 300A and300B illustrate the locations of frameshifts, premature stop codons, andamino acid changes that lead to specific sequence alterations caused byexon exclusion, and provide graphical, tabular, and sequence view withpop-up boxes, mouse hovers, and context sensitive explanations for theuser.

Protein isoforms include different protein variants that arise due tothe rearrangement of the intron-exon elements during transcription,splicing, and translation. These isoforms pave the way for proteins withdifferent structure, function, and cellular properties from a givengene, and in turn, increase the diversity of human proteins. Exonsplices 300 are the result of an algorithm (cf. algorithm 250, e.g.,including a Shapiro & Senapathy formulation) to predict whether inherentexon skipping events arising through potentially viable or destructivealternative splicing events, maintain or destroy the open reading frameof a gene, and thus have the potential to produce a viable or defectiveprotein.

The algorithm predicts the outcomes for multiple exon or domain codingexon skipping events in a human gene and analyzes the downstream effectof events on the reading frame of the gene and the translated protein.Mutations such as frameshifts 350-3, premature stop codons, and aminoacid changes 350-2 that cause protein alterations are predicted bymapping with a reference human genome from a database (e.g., database252, including a Pfam database) to locate protein domains 320. Hamperedfunctionality is also predicted by diagnosing the domain-disablingmutations or exon skipping 355-1 or 355-2 that would result in a damagedprotein (cf. FIGS. 3C-3D).

In some embodiments, the exon splice module indicates the consequencesof a mutation in a protein and their molecular defectiveness (e.g., tab344). For example, a defective protein may result from a defective genedue to a splice site mutation that leads to exon skipping 355-1. Thus,an exon splice module as disclosed herein may determine the consequenceof splicing mutations in a subject, leading to splicing aberrations. Insome embodiments, the exon splice module may interact with analternative splice module (e.g., alternative splice module 260-4) todetermine exons 310 and protein domains 320 that may include viablealternative splicing. Further, the exon splice module and alternativesplice module may indicate which exons 310 and protein domains 320 wouldintroduce unintended consequences and affect (negatively) a proteinfunctionality.

Exon splices 300 include exons 310 in grey blocks and domains 320overlaid on them. In some embodiments, domains 320 may be predictedusing an algorithm (e.g., PfamScan), to search a protein sequence forthe presence of domains 320-4, 320-5, 320-6, 320-7, 320-8, and 320-9(hereinafter, collectively referred to as “domains 320”) encoded by aselected gene. In some embodiments, the consequences of splicing any ofexons 310 coding for a particular domain 320 are predicted based on theShapiro & Senapathy algorithm and a codon degeneracy principle. Theresulting open reading frame (ORF) after splicing out every exonindividually, is checked for its correctness. When the ORF is shifteddue to exon excision, introduction of PTC 350-4, or the deletion ofdomain coding sequence 350-5 (single domain or multiple domains) arecombined to depict multiple possible consequences.

FIG. 3C illustrates an after splicing view of exon splice 300C. In someembodiments, exon splice 300C may include a skipped exon 355-1 (e.g.,exon 310-1 in exon splice 300A). Accordingly, nucleotide string 301C-1starts in a position corresponding to exon 310-2. In some embodiments,skip exon 355-1 may be the result of frameshift 350-3 combined with PTC350-4. In some embodiments, exon splice 300C includes a skipped proteindomain 355-2 (e.g., protein domain 320-2). In some embodiments, anucleotide string 301C-2 may start at a different position marked in red(e.g., a stop codon) within exon 310-2. In some embodiments, a skippedprotein domain 355-2 may be the result of overlapping domains, domaindisruption 350-6, and domain skipping (or lost) 350-5, associated with anucleotide string 301C-3.

FIG. 3D illustrates a view tab of exon splice 300D including a completecoding sequence, before splicing 301D-1, and after splicing 301D-2(hereinafter, collectively referred to as “nucleotide strings 301D”) ofthe selected transcript. Exon splice 300D displays a “sequence view” ofnucleotide strings 301D representing stop codons 341-1, 341-2, 341-3,341-4, 341-5, 341-6, 341-7, and 341-8. In some embodiments, the exonsplice module may also display a pathogenicity score for mutations 341,marked by an in-silico classifier. In some embodiments, the user mayrequest a tabular illustration displaying information for selectedprotein domains 320 and their respective exons 310, such as encodingdomain name, Pfam ID (domain identifier), Start/End within exons,Start/End within transcript, and the exons coding for the domain. Table1 below shows an exemplary table providing a summary of some ofmutations 341 and their consequence (cf. protein effects 350).

TABLE I Original sequence New sequence Consequence ATGTGCAATTCCTGAATGTATTCCTGA AA change ATGTGCAAGTCCTGA ATGTAGTCCTGA AA change + PTCATGTGCAATTCCTGA ATGAATTCCTGA AA maintained ATGTGCACGTCCTGA ATGTACGTCCTGAFrameshift ATGTGCAAGTCCTGA ATGTAAGTCCTGA Frameshift + PTCATGTGCACTGACTGA ATGTACTGACTGA Frameshift + PTC

Exon splices 300 aid in determining if the alternate splicing of exonsencoding a domain is genuine based on if such splicing leads to PTC orframeshift in the protein sequence, or does not alter the proteinsequence's frame thus maintaining the downstream sequence. This approachcan thus identify genuine alternative splicing events or spuriousevents, incorrectly annotated due to methodological difficulties. Exonsplices 300 thus identify the genuine and spurious alternative splicingoccurring in all of the human genes and catalogues them as biologicalalternative splicing. It also enables identification of defectivesplicing due to various splice site mutations and the defective proteinsleading to diseases.

Using information collected by exon splices 300 for any selected gene orportion of the human genome, the exon splice module may create arepository containing exon details with the encoding domains, genomicposition of the exons, transcript details, exon length, proteinstructure of the protein domain, amino acid sequence, hydropathy indexfor each of the amino acids, and consequences of the exon splicing andmutations. Accordingly, embodiments as disclosed herein enable thesearch option for genes from various gene panels such as disease panels,drug metabolizing gene (DMG) panel, the American College of MedicalGenetics and Genomics (ACMG) gene panel, and other user given genepanels.

FIGS. 3E and 3F illustrate a repeating pattern of exons and a domain inthe gene MUC16 in an exon splice map, according to some embodiments.

FIG. 3F illustrates a pattern 370 with a consecutive repetition of fiveexons along with the domain encoded by three of the five exons.Embodiments as disclosed herein enable the identification of pattern370, which may have clinical relevance across multiple maps.

FIGS. 4A-4C illustrate details of cryptic splices 400A, 400B, and 400C(hereinafter, collectively referred to as “cryptic splices 400”),according to embodiments disclosed herein. Cryptic splices 400 may beprovided by a cryptic splice module, as disclosed herein (cf. crypticsplice module 260-2). A cryptic splice site (CSS) is defined as asequence of 15 bases for acceptors 412 a and 9 bases for donors 412 d(hereinafter, collectively referred to as CSS 412) that match closelywith the real splice sites in sequence regions other than the realsites, anywhere within a nucleotide string 401A. A cryptic exon 415 isdefined as a sequence between cryptic acceptor 412 a and a cryptic donor412 d with at least one of the open reading frames (ORF) 417 betweenthem. In some embodiments, CSS 412 is also formulated by modifying otherrelevant splice site prediction algorithms to predict CSS 412 for thedifferent regulatory elements by using position weight matrix (PWM)methods, consensus sequences, and sequence lengths specific for thedifferent elements respectively. Different protein domains 420-1, 420-2,420-3, and 420-4 (hereinafter, collectively referred to as “proteindomains 420”) are also illustrated.

The user selects a gene based on the search criteria from the drop-downlist enabled under each of the search options such as genes, clinicalassociations, number of cryptic sites and cryptic exons. The crypticsites and cryptic exons are depicted on the gene plot along with theirscores, sequence and other additional information and presented as thegene view, sequence view, and table view. CrypticSplice enables the userto modify the cryptic site score threshold and cryptic exon lengthcriteria to analyze the CrypticSplice map of the selected gene. Realexons and splice sites are displayed in the selected transcript as shownin the key. Any cryptic splice sites and cryptic exons that occur withinthe transcript are also displayed. A cryptic splice site is defined as asequence of 15 bases for acceptors and 9 bases for donors, which has aShapiro-Senapathy algorithm score that is above the selected scorethreshold. A cryptic exon is defined as a sequence between a crypticacceptor and cryptic donor that falls within the selected length range.

FIG. 4A illustrates cryptic splice 400A, according to some embodiments.

FIG. 4B illustrates cryptic splice 400B including CSSs 412 that pass aselected score threshold and are detected and mapped onto the genesequence. The scores of real splice sites (e.g., real acceptor 402 a andreal donor 402 d, collectively referred hereinafter as “real splicesites 402”), real exons 410, cryptic splice sites 412, and cryptic exons415 are shown on a nucleotide string 401A, 401B, or 401C (hereinafter,collectively referred to as “nucleotide strings 401”), creating alandscape of real and cryptic splice scores across the gene. Thesescores can be used to predict where erroneous splicing may occur if amutation weakens a real splice site or strengthens a cryptic splicesite. They also indicate the occurrence of alternative splicingpositions in the gene during biologically mediated alternative splicing,and alternative splicing aberrations due to mutations. The crypticsplice module, using cryptic splices 400, reliably identifies hiddensplice sites and exons in a selected gene, exposing likely locationsthat the splicing machinery will target under different biological anddisease conditions.

Cryptic splice 400B enables a user driven approach to identify andcorrelate the mutations in cryptic acceptor 412 a and cryptic donor 412d, and cryptic exons from any subject exhibiting any cancer ornon-cancer disorders. A fully coding exon 410 is delimited on its 5′ endby a 5′ partially coding exon 422-5 and on its 3′ end by a 3′ partiallycoding exon 422-3. In some embodiments, a non-coding exon 414 may alsobe indicated. A slide button 470 may enable the user to pan nucleotidestring 401A in either direction (towards the 5′ and 3′ ends) forconvenience. In some embodiments, cryptic splice 400B enables a patternanalysis of variations in cryptic splice sites and cryptic exons fordifferent transcripts of a given gene, and across different genes. Insome embodiments, cryptic splice 400B displays mutations 450B from oneor more subjects within the real and cryptic splice sites and exons, anddetermining the pathogenicity of these mutations by comparing the scoresobtained in the normal sequence, and displaying the mutations within anyof these features and genetic elements in color codes, graphical,tabular, and sequence illustrations. In some embodiments, cryptic splice400B identifies the consequences of mutation in these structures of thetranscripts and the genes with graphical and sequence illustrations,plotting subject mutations in a real or cryptic splice and exonicregions, and the known mutations from different databases such as dbSNP,ClinVar, and COSMIC, and categorized into clinical significance,molecular consequence, variation type, and pathogenicity based on theSIFT and/or PolyPhen scores. In some embodiments, cryptic splice 400Boverlays the subject(s) mutations on the gene and the genome, comparingthem with the known mutations from the databases, by visualillustrations and analytical tools in graphical, tabular, and sequenceviews with pop-up boxes, mouse hovers, and context sensitiveexplanations. In some embodiments, cryptic splice 400B displays theexon-intron structure of a selected gene including the promoters, UTRs,and poly-A sites, and overlaying the cryptic donor and acceptor sitesand cryptic exons, as well as the subjects' variants and mutationsacross these features on the entire gene in a graphical display of thegene.

FIG. 4C illustrates cryptic splice 400C, according to some embodiments.

In some embodiments, individual exons and introns, un-transcribed andun-translated regions of the selected transcript of a given gene areanalyzed independently. These sequences are split into acceptors (15bases) and donors (9 bases) by several methods including sequence PWMmethods using the Shapiro & Senapathy algorithm, and scores arecalculated for each 15/9 mer (e.g., each acceptor/donor pair). The siteshaving 15/9 mer score higher than the cut-off threshold score of 50 areconsidered as cryptic sites. The cryptic splice module may use othermethods such as sliding window methods, motif search methods, methodsusing motif sequence lookups, and sequence alignment methods to detectand analyze CSSs 412. Furthermore, Splice Atlas has discovered thatdifferent sequence lengths for donor and acceptor splice sites may bemore optimal when compared with 15/9 bases, which are also used.

For each transcript, valid CSSs 412 are taken as study sequences.Cryptic exons 415 are formed from the last base of a cryptic acceptor412 a to the first three bases of a cryptic donor 412 d. In someembodiments, a cryptic splice module as disclosed herein combines eachcryptic acceptor site 412 a with each of the cryptic donor sites 412 dthat occur within a chosen exon length limit. The scores are calculatedfor each exon possibility using a suitable algorithm (e.g., algorithm250, including a Shapiro & Senapathy algorithm). In some embodiments,sequences having lengths between a minimum cryptic exon length 417 and amaximum cryptic exon length 419 (e.g., 50 and 500 bases) and a scorehigher than a minimum cut-off threshold score 425 (e.g., 50) areconsidered as cryptic exons 415. CSSs 412 also enable methodologicalvariations of forming cryptic exons 415. In some embodiments, crypticsplices 400 may predict cryptic exons 415 based on the cryptic donor andacceptor splice site scores using equal or unequal weights for the donorand acceptor scores, and assigning a score for cryptic exons 415.

Cryptic splices 400 may identify and map a CSS 412 and a cryptic exon415 in the human genome according to score threshold 425 withinnucleotide strings 401A and 401B (hereinafter, collectively referred toas “nucleotide strings 401”). The scores of real splice sites 402, realexons 410, cryptic splice sites 412, and cryptic exons 415 are alsoshown, creating a landscape of real and cryptic splice scores across thegene. These scores can be used to predict where the spliceosome mayaccidentally turn if a mutation weakens a real splice site 402 orstrengthens a cryptic splice site 422. They also suggest where thespliceosome may purposefully turn during biologically mediatedalternative splicing. Cryptic splices 400 thus bring to light the hiddensplice sites and exons in any gene, exposing the most likely locationsthat the splicing machinery will target under different biologicalconditions. Cryptic splices 400 also enable visualization and analysisof mutations from the sequence of a subject or cohort. In someembodiments, cryptic splices 400 may identify CSSs 412 in the genome ofany organism including humans, animals, plants, bacteria, or fungi.

Cryptic splices 400 enable nested search boxes for the user to choosethe genes having the highest number of cryptic sites, cryptic exons,highest cryptic site score, exon score, disease associated genes, andexceptional genes. Cryptic splices 400 create a repository containinginformation for genes in a genome such as exon details with the encodingdomains, genomic position of the exons, transcript details, exon length,protein structure of the domain, real and cryptic splice donors,acceptors and exons, and enabling the display and analysis of any geneby a query. Cryptic splices 400 enable a search option for genes basedon various parameters of cryptic sites, cryptic exons, cryptic splicesite scores, and cryptic exon scores, and based on various gene panelssuch as disease panels, drug metabolizing gene (DMG) panels, theAmerican College of Medical Genetics and Genomics (ACMG) gene panels,and other user given gene panels, and enable the display and analysis ofany gene by a query. In some embodiments, a tab 461 may indicate thetotal number of cryptic sites (e.g., in a selected gene), and a tab 465may indicate a total number of cryptic exons. A toggle switch 450A mayturn on/off the illustration of mutations in the gene, as well.

In some embodiments, cryptic splices 400 create a landscape of real andcryptic splice scores across the entire gene by layering the scores ofreal splice sites, real exons, cryptic splice sites, and cryptic exonson the gene structure. In some embodiments, cryptic splices 400 predictthe impact of a subject mutation on the action of the spliceosome, anddetermining where the spliceosome may erroneously make a mistake if amutation weakens a real splice site or strengthens a cryptic splice siteor vice versa by using the splice site scores. In some embodiments,cryptic splices 400 predict where the spliceosome may purposefully turnduring biologically mediated alternative splicing, identifying thehidden splice sites and exons in any gene, and exposing the most likelylocations that the splicing machinery will target under differentbiological and disease conditions. In some embodiments, cryptic splices400 enable the use of tightly coupled navigation by interlinkingdifferent maps to provide analysis of a gene, protein, or other elementsand features throughout the platform. In some embodiments, crypticsplices 400 perform analysis of subject mutations overlaid on thecryptic splice site patterns on the genes to correlate theirassociations with disease. Mutations from databases such as ClinVar,dbSNP, and COSMIC on cryptic splice site patterns on the genes may becorrelated with disease. In some embodiments, cryptic splices 400identifies real and cryptic splice sites using other relevant algorithmssuch as MaxEntScan, NNSplice, and Human splicing Finder throughout thegene sequence and genes in the genome and its application to subject andcohort genomics. In some embodiments, cryptic splices 400 discoversdifferent sequence lengths for donor and acceptor splice sites that aremore optimal when compared with 15/9 bases. Applying these lengthsdetects splice sites and their cryptic versions.

FIG. 5A illustrates an exon chart 500, according to embodimentsdisclosed herein. Exon chart 500 may be provided by an exon chart module(cf. exon chart module 260-3), as disclosed herein. Exon chart 500 is amap of the exon length 520 within human genes, containing multipledetails for exons 510-1, 510-2, 510-3, 510-4, 510-5, 510-6, 510-7,510-8, 510-9, 510-10, 510-11, 510-12, 510-13, 510-14, 510-15, 510-16,510-17, 510-18, 510-19, 510-20, 510-21, 510-22, 510-23, 510-24, 510-25,510-26, and 510-27 (hereinafter, collectively referred as “exons 510”).It displays the coding sequence (CDS) for each gene, and displays thelengths 520 of exons 510 and their associated splice scores in agraphical and sequence view. It additionally highlights any exon lengthrepetition in each CDS, wherein multiple exons have the same length.Exon chart 500 also isolates exons 510 that have a highly outlyinglength compared to other exons 510 in a gene, and lists their splicescores as well as any cryptic splice sites contained within them. Exonchart 500 is thus a visual platform for analyzing the classification ofexon lengths 520 and their accompanying splicing features, includingunusual exon patterns in distinct genes. In some embodiments, exon chart500 may provide further details for exons 510 (including the Shapiro &Senapathy score of the exon, acceptor and donor sites, and the exon andintron lengths) upon mouse hover by the user over the specific bar for agiven exon 510.

Exon chart 500 maps exons in human genes as graphs of exon lengths 520within each gene that creates visual bar charts of patterns such asunique exon length distribution, outlying exons, and length repetition.A detailed analysis of two additional features are also illustrated. Thefirst is exon length repetition, wherein multiple exons in the gene havethe same length. The second is highly outlying exon lengths (e.g., exon510-11), in which one or more exons in the gene may be exceedinglylonger than the others (e.g., exons 510-10 and 510-11). Each repeated oroutlying exon length 520 can be further examined with full nucleotidestring views and associated splice site scores.

The cause and effect of these features in human genes, which haveimportant functions in a large number of diseases, are yet to beunderstood. As exon length 520 and intron length are closely associatedwith the splice sites and their sequences, exon chart 500 enables theunderstanding of these unusual features for selected human genes. Byconsolidating the lengths and splice site sequences of exons from thehuman genes, exon chart 500 permits the detection of exons 510 withunusual repeat lengths, outlying exon length patterns, and splicingpatterns in distinct genes and their biological implications, andstudies their associations with disease.

Exons 510 may be classified based on their coding features: 5′non-coding sequences 514, 3′ non-coding sequences (not shown in FIG. 5),5′ partially coding sequences 512, 3′ partially-coding sequences 513,and fully coding sequences 511. Various exons 510 present in a gene arecharacterized into multiple categories based on their length 520 toidentify the exon length repetition property, highest exon lengths 520to signify the “outliers” in the gene (exons with >3 times the length ofaverage exons, e.g., exon 510-10 and 510-11), and the exception geneswhich contain no stop codon, in-frame stop codon, or selenocysteinecodon sequences.

The splice acceptor and donor scores for each of the exon-intronjunction sites are calculated using an algorithm (e.g., algorithm 250,such as the Shapiro & Senapathy and other relevant algorithms) to depictthe biological probability and impact of the splicing event occurring atthese sites. Cryptic splice sites are also determined based on theShapiro & Senapathy and other relevant algorithms, within the selectedexon sequence and their scores are tabulated. In addition, these realand cryptic splice sites are highlighted in the sequence view of theexons (cf. real splice sites 402 and CSSs 412).

The distribution of length of the exons in each of the transcripts of agene is determined based on the CDS information including the number ofcoding exons, and length of each exon of all the transcripts of a gene.The length of the exons are plotted against each of the exons in atranscript and its distribution is identified. Furthermore, the lengthof exons that are repeating in a transcript is also identified andtabulated. Outlying exons are defined using various methods, includingthe exons with length greater than thrice the average exon length.

In addition, the exon chart module enables the user to search and queryexon chart 500 to identify CDS length: Genes based on the codingsequence length can be searched ranging from 1-110,000 bases, whichdirectly reflects the gene length. The exon chart module also enablesthe user to search and query exon chart 500 to identify exon lengthrepetition: Genes that have exons of repetitive coding lengths can bechosen to determine the distribution of the exon lengths that arerepeated. The exon chart module also enables the user to search andquery exon chart 500 to identify outlying exon lengths: Genes thatcontain a significantly higher exon length when compared to the otherexon lengths are termed as “outlying” exon lengths. Such genes withstark differences are identified by incorporating a rule that theoutliers should be >3 times the length of average exons. The exon chartmodule also enables the user to search and query exon chart 500 toidentify a gene: Defined based on the gene nomenclature, proteinidentifiers, clinical association, number of domains, and number ofexons per domain, which are based on the user's preferences. The exonchart module also enables the user to search and query exon chart 500 toidentify clinical association: The disease association of somaticcancer, germline cancer, non-cancer inherited disorders, industrialpanels, ACMG gene panel, and DMG panel are enabled in the dropdown list.The exon chart module also enables the user to search and query exonchart 500 to identify exception genes: Genes that exhibit a rarecharacteristic exon behavior such as containing an in-frame stop codon,selenocysteine codon, or no stop codons at the end of the gene that arepresent in the sequence are shown.

In some embodiments, exon chart 500 may also identify the biomarkermutations for exons 510 from various data sources such as dbSNP,ClinVar, and COSMIC (cf. database 252), reported for different diseases.Accordingly, exon chart 500 may determine and illustrate a probabilityto develop a disease, for a given subject.

The exon chart module may also provide tabulated information based onexon chart 500, as illustrated in Table 2, below. For example, the exonchart module may identify exons 510 when an exon length is greater thanor equal to three times an average exon length of a gene, based on exonchart 500. Accordingly, Table 2 indicates cryptic splice sites thatoccur within an outlying exon (e.g., exon 510-10 or 510-11). Table 2displays the nucleotide string for the selected outlying exon, togetherwith the real acceptor and donor as well as cryptic acceptors and donorsare depicted in different colors on the sequence view.

TABLE II Total Real Cryptic Total Exon Accepter Real Donor AcceptorCryptic Number Exon length Score Score Sites Donor Sites Exon 11 4,93288.16 86.70 119 34 Select cryptic score threshold: 70Cryptic acceptor sites Cryptic donor sites Position Sequence ScorePosition Sequence Score 12 TCTTCTGAAAGA 72.58 232 AAGGACAGT 72.46 25GAAGCTGTTCACAGA 70.68 470 AATGTCAGA 71.45 60 TTGTCCTTAACTAGC 74.15 487AAGGTAACA 81.78 119 TTCTAATAATACAGT 78.06 549 GATGTATGT 74.04 130CAGTAATCTCTCAGG 78.44 633 AAGGTACAA 71.57 146 TCTTGATTATAAAGA 76.08 889CAGGTGATA 78.04 191 ATTTATTACCCCAGA 83.95 1,108 GAGGTAGCT 74.47 209TGATTCTCTGTCATG 71.04 1,456 GAAGTCAGT 74.69

FIGS. 5B and 5C depict visualizations of an exon length distributionpattern 550 in the gene MUC16 in ExonChart Map.

FIG. 5B illustrates the exon length 550 having a long tail 550Cindicative of a repeated pattern. The tail 550C includes the repetitivepatterns of exon lengths having a marginal size.

FIG. 5C illustrates the tail end 550C of distribution pattern 550repeated in a specific fashion. In tail 550C, exons of length 36 arerepeated 15 times, exons of length 66 are repeated 10 times, and so on.Each block of 5 exons, with the lengths 173, 36, 66, 125, and 68, arerepeated consecutively. It is to be noted that this gene MUC16 is animportant cancer gene. This pattern is also connected with therepetition of a domain that is encoded by these exons as visualized inthe Exon Splice map.

FIGS. 6A-6D illustrate exemplary embodiments of alternative splices600A, 600B, 600C, and 600D (hereinafter, collectively referred to as“alternative splices 600”), as disclosed herein. In some embodiments,alternative splices 600 are provided by an alternative splice module asdisclosed herein (e.g., alternative splice module 260-4). Alternativesplices 600 illustrate alternative transcripts of a nucleotide string601A-1, 601A-2, 601A-3, 601B-1, 601B-2, 601B-3, 601B-4, 601B-5, 601B-6,601B-7, and 601B-8 (hereinafter, collectively referred to as“transcripts 601A,” “transcripts 601B,” and “canonical transcripts601”). Alternative splice 600A illustrates a canonical-based spliceevent, and alternative splice 600B illustrates an exon-based spliceevent.

Transcripts 601A may be identified by the length of a CDS (e.g., thelongest, or one of the longer CDSs) in a gene (length summation of allcoding exons). In some embodiments, transcripts 601A are identified by alength of the corresponding mRNA (e.g., the longest, or one of thelonger mRNAs) in a gene (length summation of all mRNA exons thatincludes coding and non-coding regions). In some embodiments,transcripts 601A may be identified by a number of mRNA exons (e.g., thehighest, or one of the higher numbers) in the gene. In some embodiments,transcripts 601A may be identified by a number of coding exons (e.g.,the highest, or one of the higher numbers) in the gene. When there aremore than one transcripts 601A with the same values in a selected methodfrom above, a canonical transcript 601 may be selected when it isannotated as “canonical” in a third party database (e.g., database 252,such as UniProt database). When the “canonical status” is not availablein a third party database, the alternative splice module randomlyassigns the “canonical” status to one of the two or more transcriptsthat satisfy the above criteria.

FIG. 6A illustrates alternative splice 600A including constitutiveevents 602-1 and 602-2 (hereinafter, collectively referred to as“constitutive events 602”), wherein the same exons are present as in thecanonical transcript; alternative acceptor and alternative donor events604, wherein both start and end exon positions of the present transcriptis different from the constitutive exon. Alternative splice 600A mayalso include mRNA exons 606-1, 606-2, 606-3, 606-4, 606-5, 606-6, 606-7,606-8, 606-9, and 606-10 (hereinafter, collectively referred to as “mRNAexons 606”); skipped events 608, wherein skipped exons are theconstitutive exons that are not present in the transcript of interest;and cryptic events 610, wherein exons occur newly in the presenttranscript as compared to the constitutive exons. Alternative splice600A may also include intron retention events 612, wherein an exon startposition of the present transcript matches with one exon in canonicaland exon end position match with the end of another exon. The intronicregion between these two exons is marked as intron retention event 612.Alternative splice 600A may also include an alternative donor event 614,wherein an exon start position of the present transcript is the same asthat of constitutive exons but the exon end position is different; andan alternative acceptor event 616, wherein the exon end position of thepresent transcript is the same as that of constitutive exons but theexon start position is different.

FIG. 6B illustrates alternative splice 600B including an ‘Exon based’selection, wherein the canonical transcript includes exons identified bynumber or count of exons across the transcript for a given gene. Theexons which occur more than or equal to 50% of the total number oftranscripts, are identified as constitutive exons 602. When multipleconstitutive exons 602 exist, an alternative splicing event is definedwith respect to any exon that shares a start or end position with theexon in the selected transcript. In some embodiments, constitutive exon602 is selected from exons that are present in more than 50% of thenumber of transcripts in a gene and are classified as constitutiveexons. To illustrate the above, transcripts 601B illustrate constitutiveexon 602A in the 5′ end of transcripts 601B-1, 601B-2, 601B-3, 601B-4,601B-7, and 601B-8, followed by constitutive exon 602B in transcripts601B-1, 601B-2, 601B-3, 601B-4, 601B-6, and 601B-8. An alternative donorand acceptor event 604 occurs when both the donor and acceptor splicesites of an alternative exon are different from that of the constitutiveexon (e.g., alternative donor and acceptor event 604-6 in transcript601B-6). Accordingly, the alternative exon is assigned with bothalternative donor and alternative acceptor sites. A skipped exon event608 occurs when any of the constitutive exons 602 are missing in atranscript, and these missing exons are classified as skipped exons inthat transcript. To illustrate this, skipped exon events 608A indicatethe skipping of constitutive exon 602A in transcripts 601B-5 and 601B-6;skipped exon events 608B indicate the skipping of constitutive exon 602Bin transcript 601B-5 and 601B-7; and skipped exon event 608C indicatesthe skipping of alternative exon 610E in transcript 601B-4.

A cryptic exon event 610 occurs when an exon found in less than 50% ofthe transcripts in a gene are classified as a cryptic exon. Toillustrate the above, alternative exon 610A appears in transcripts601B-1 and 601B-5, only; alternative exon 610B appears in transcripts601B-1 and 601B-7; alternative exon 610C appears in transcripts 601B-1,601B-2, 601B-4, and 601B-5; alternative exon 610D appears in transcript601B-1; alternative exon 610E appears in transcripts 601B-1, 601B-2,601B-6, and 601B-7; and alternative exon 610F appears in transcripts601B-2, 601B-3, and 601B-8. An alternative donor event 614 occurs whenan exon has a donor splice site different from that of constitutive exon602, and that different splice site is classified as an alternativedonor site (as indicated by alternative exon 614-7 in transcript601B-7). An alternative acceptor event 616 occurs when an exon has anacceptor splice site different from that of constitutive exon 602, thenthe acceptor splice site is classified as an alternative acceptor site(as indicated by alternative exon 616-8 in transcript 601B-8).

FIG. 6C and FIG. 6D illustrate alternative splices 600C and 600D inalternative splicing of isoforms of the gene TP53, showing the types ofexons such as constitutive, cryptic, and altered acceptor and/or donorin different color codes.

In some embodiments, alternative splices 600 may display the exons ofcanonical transcripts 601. In some embodiments, alternative splices 600may display the exons of currently selected transcripts. In someembodiments, alternative splices 600 display coding exons of a currenttranscript. In some embodiments, alternative splices 600 displayavailable domains for a particular transcript. In some embodiments,alternative splices 600 display splice events of exons in the currenttranscript by comparing with canonical transcripts 601. In someembodiments, alternative splices 600 also illustrate mutations fromindividual subjects and from cohorts of subjects for a selected gene.This visualization aids in the analysis of their disease associations.In addition, domain analysis sections displaying the domains coded bythe exons of alternatively spliced events are also enabled.

Based on the various search options including genes, number oftranscripts, splice events, and clinical associations, the alternativesplice view of the selected gene is visualized. The constitutive exonscan be selected based on the two methods (Canonical and Exons). In the‘Canonical Based’ selection, the “longest CDS” option displays thecanonical transcript, showing the coding exons, non-coding exons,pre-spliced domains, and the alternative splicing events respective tothe canonical transcript. In the ‘Exon based’ selection, thealternatively spliced exons are classified and shown as constitutive,cryptic, alt donor, alt acceptor, alt acceptor+donor, exon skipping, andIntron retention events for the selected gene and transcript.

FIGS. 7A-7E illustrate exemplary embodiments of exon frames 700A, 700B,700C, 700D, and 700E (hereinafter, collectively referred to as “exonframes 700”), as disclosed herein. Exon frames 700 may be provided by anexon frame module as disclosed herein (cf. exon frame module 260-5).Exon frames 700 include maps coding exons in a gene and designate theirreading frame in a transcript indicated as 711-1, 711-2, and 711-3(hereinafter, collectively referred to as “reading frames 711”). Exonframes 700 include a picture of exons 710-1, 710-2, 710-3, 710-4, and710-5 (hereinafter, collectively referred to as “exons 710”) in readingframes 711. In some embodiments, coding exons 710 are placed in thereading frame in which they occur before RNA splicing. Exon frames 700include an image of the entire split gene, with exons 710, introns, andstop codons 712-1 (TAA), 712-2 (TGA), and 712-3 (TAG, hereinaftercollectively referred to as “stop codons 712”) that occur within eachframe. In some embodiments, stop codons 712 are scanned in a slidingwindow method against the nucleotide strings and placed in therespective reading frames. Exon frames 700 streamline the detection ofatypical gene patterns, such as long exons (cf. exon 710-5), long openreading frames without annotated exons, or short introns. In someembodiments, exons 710 are displayed in a single reading frame of thegene along with their splice sites and scores.

Based on the available search criteria, a transcript for the selectedgene is displayed with reading frames 711 before and after the splicingprocess.

FIG. 7A illustrates exon frame 700A, which presents a transcript inreading frames 711 with stop codons 712, and coding exons 710, beforesplicing.

FIG. 7B illustrates exon frame 700B, which presents the transcript inreading frames 711 with stop codons 712, and the longest CDS 718(spliced exon), after splicing.

FIG. 7C illustrates exon frame 700C, which displays the distribution ofstop codons 712 in a randomly generated nucleotide string. In someembodiments, stop codons 712 are marked in different colors. The exonsare plotted in a different color (e.g., as rectangles 718).

The reading frame, exon number, position, length, and several otherdetails 752 are displayed upon mouse hovering the coding, non-coding, orpartially coding exons 710. The selected gene is also represented in asingle reading frame with exons and stop codons in the same pattern of“Before splicing” and “After splicing” views (e.g., exon frames 700A and700B, respectively). Exon frames 700A, 700B, and 700C may offer one ormore graphic interface features such as a toggle 742 to select thedisplay of all the exon length, a toggle 744 to expand the exon display,and a selection tab 746 to include either exons 710, stop codons 712, orboth in the display.

FIG. 7D illustrates exon frame 700D including, for each gene, a unique“ExCode” 710D, which portrays exon lengths as lines in a bar, a code711D portraying reading frame lengths, or a code 751D which portraysmRNA length. Codes 710D, 711D, and 751D uniquely identify a gene, akinto a barcode identifier. Exon frames 700 thus enable a clear view intothe special features of reading frames 711, exons 710, introns, andtheir correlations that exemplify eukaryotic split genes.

FIG. 7E illustrates exon frame 700E that illustrates the distribution ofORFs length in a randomly generated sequence with the same gene lengthto compare their frequency with the ORF length distribution in realsequence. An amchart representing an overlapping curve is shown with thedistribution of the length of ORFs in random sequence and real sequence.Frequency is labeled in Y-axis and the length of ORFs in X-axis. Thelength of ORFs and its frequency are reported upon mouse hover.

Nucleotide strings 701A, 701B, and 701C (hereinafter, collectivelyreferred to as “nucleotide strings 701”) are scanned for the variousstop codons 712 and labeled in each reading frame 711, e.g., indicatedby color—red blue green—and the like). The coding exons are splicedtogether, which is a form of RNA processing. Exon frame 700B displaysthe longest CDS 718 that occurs after splicing in the single readingframe of the gene along with their splice sites and scores. Exon frames700 also illustrate, in a randomly generated nucleotide string,detection of any long introns and exons, short introns, unusualdistribution of stop codons, and long open reading frames. In someembodiments, exon frames 700 also enable a clear view of the sequencefeatures that exemplify eukaryotic split genes.

In some embodiments, reading frames 711 are computed and plotted bydividing the length of exons as follows: i) RF (Reading Frame)=((ExonStart on gene −1) % 3) (or); ii) RF=(Exon Start on gene −Previous ExonEnd on gene+Reading Frame of Previous Exon+1) % 3; iii) If thecalculated result is 0, the exon is placed at reading frame 711-1; iv)If the calculated result is 1, the exon is placed at reading frame711-2; v) If the calculated result is 2, the exon is placed at readingframe 711-3.

The length of the spliced exons (e.g., CDS 718) and spliced string arecalculated as follows: i) Spliced Length=(First Exon Start −1)+(Sum ofthe Length of all exons)+(Gene End −Last Exon end); ii) SplicedString=Concatenate Sequence before first exon, all exons sequences, andsequence after last exon.

Stop codons 712 are scanned against the spliced string in a slidingwindow against and plotted in reading frames 711. The reading frame forthe spliced exons (e.g., CDS 718) is calculated as: ((First exon start−1) % 3). In addition, a random nucleotide string with the same genelength is generated.

In some embodiments, the lengths of exons, ORFs 711, and mRNA areanalyzed in ExCode. The length of all exons, ORFs, and mRNA are markedon a graphical illustration for all the selected transcripts. Stopcodons 712 and ORFs that are available are marked for reading frames711. The length of ORFs and Exon lengths are compared creating exonidentifier 710D for the transcripts. The distance between two stopcodons 712 is thus analyzed and other possible stop codons are mappedwhile expecting them not to fall inside coding exons 710.

Exon frame 700E illustrates a distribution of ORF lengths for a givengene. The original gene sequence and the randomly generated genesequence are sourced and ORFs are plotted as real ORF lengths 721E-1,random ORF length 721E-2 (hereinafter, collectively referred to as “ORFlengths 721E”), or mRNA length 751E. The ORFs are identified assequences between two consecutive stop codons and their lengths 721E aredetermined accordingly. ORF lengths 721E and their frequencies arecollected in reading frames 711 and plotted to analyze the distributionof ORF lengths 721E. In some embodiments, tabular information may beprovided to the user by a mouse hover over exon frames 700 (cf. TableIII, below).

TABLE III Exon Acceptor 3′ Donor 5′ Number splice signal splice signal 1NA CAG|GTGAGC 2 CTCTTCTTTTTCAG|A GTG|GTAAGC 3 TTCTTGCTCTTCAG|GAAG|GTAGGC 4 TCTTGTCCCCGCAG|C AAG|GTACGT 5 TTCCCTTCCCACAG|G AAG|GTATTT 6TTAATCTTTTACAG|A CAG|GTAAAG 7 CTTTTGGTTTTCAG|G GAG|GTACTG 8CATTCTAATCTAG|G CAG|GTACGT 9 TCTATGAAAGCAG|G CAG|GTGAAA 10ATCATTCTTTGCAG|A CGA|GTAAGT

In Table 3, stop codons that occur at the −3 position of the acceptorand the +2 position of the donor are highlighted (e.g., in ‘red’).

In some embodiments, exon frames 700 may include nested dropdown liststo select different parameters to display information and details suchas various transcripts of a selected gene sourced from a third partydatabase (e.g., NCBI database). The user can study exon frames 700 forgenes with varying length of the coding sequence ranging from 1-110,000bases in the list, or more. For each range, exon frames 700 provide thecorresponding transcripts. In some embodiments, exon frames 700 includea dropdown list to enable genes having from 1-400 exons, or more, fromwhich the user can select genes and study the pattern of exon readingframes and splicing events. For each range, the correspondingtranscripts are also provided. In some embodiments, exon frames 700 mayenable selecting genes according to their length, e.g., ranging from1-3,000,000 bases, or more. Upon selecting the range of length, thecorresponding genes are listed for which the user can study the patternsof exon frames and splicing events.

In some embodiments, exon frames 700 enable the user to select genesbased on a clinical association. Accordingly, the user may select genesfrom panels for different disease categories such as Somatic cancers,Germline cancers, Non-cancer Inherited disorders, Industrial panels,ACMG, and DMG panels, to visualize the exon frame for the selectedgenes.

In some embodiments, exon frames 700 enable the user to select exceptiongenes that: (i) Contains an in-frame stop codon: Genes having stopcodons (TAA, TGA, TAG) inside the reading frame; (ii) Contain aselenocysteine: an unusual stop codon (mostly TGA) in the codingsequence, and (iii) Contain no stop codons 712.

FIGS. 8A-8D illustrate exemplary embodiments with protein charts 800A,800B, 800C, and 800D of a protein signature (hereinafter, collectivelyreferred to as “protein charts 800”), according to embodiments disclosedherein. Protein charts 800 are associated with amino acid strings 801A,801B, and 801C (hereinafter, collectively referred to as “amino acidstrings 801”).

Cryptic splice sites within the domain coding exons are determined basedon the S&S and other relevant algorithms and plotted on the exons abovethe domain signatures. Cryptic splice sites are shown based on theselected score threshold as red boxes for acceptors and green boxes fordonors and highlighted their corresponding sequence in the codonsequence below the exons. Alternative splicing events are determined bycomparing the exons from the canonical transcripts in two differentways. 1) Exons that are present in 50% or more of transcripts aredefined as constitutive, and alternative splicing events in thesignature are shown with respect to these exons. 2) The transcript withthe highest number of exons are defined as canonical, and alternativesplicing events in the signature are shown with respect to thistranscript. Any alternative splicing events that occur based on theoccurrence of exons or relative to the canonical transcript arehighlighted in these exons. Any skipped exons (or portions of exons) inthe selected transcript are shown in black, and any added exons (orportions of exons) are shown in blue. The AA signature for any skippedor added exon region is shown below the corresponding exon positions.

The splice sites identified in the coding exons are predicted byemploying the Shapiro-Senapathy and other relevant algorithms Based onthe variable threshold score range (e.g., 50 to 100) chosen, the splicesites having scores within the selected range are visualized on theplot. The cryptic and real splice sites are depicted in various colorcodes and overlain on the CDS plot and the signatures as well. Thedomain coding regions of the coding exons are aligned to the humandomain sequence above the domain signature and the cryptic splice sitesfalling within this region are displayed along with their scores andsplice sequences.

Genes can be searched based on various search criteria. Based on theselection criteria, the gene and transcript information are displayed.Information like gene name, chromosome number, gene ID, strand, proteinID, protein length and number of exons are displayed along with detailson gene ontology and phenotype on clicking the “Gene Info” buttonavailable in the information strip. ProtSig is divided into threedifferent sections: Protein overview, Cryptic splice sites, and Variantdensity.

A protein overview section visualizes the coding exon of the selectedgene along with the domain information overlaid as colored lines. Bydefault, the compact view of the module is displayed. The expanded viewcan be displayed by switching the expanded view toggle “ON” at the topof the plot. The mutations curated for the selected gene can be depictedon the plot by configuring the mutations toggle. The databases used incurating the mutations for the selected gene are: dbSNP, ClinVar, andCOSMIC. The mutation details fetched from the respective databases aredisplayed on hover of a particular mutation. The mutations from apatient exhibiting a disease can also be overlaid on any of the proteinsignatures and on the Positive-Negative protein signatures.

The cryptic splice section aids in visualizing exons that encode thedomain, along with their codon and AA sequences. The cryptic splicesites within these domain coding exons are determined using the S&S andother relevant algorithms and are marked on the exons based on theselected score threshold from the dropdown. The cryptic acceptors arerepresented in red color and the cryptic donors are shown in greencolor. The score and sequence of the cryptic splice sites are displayedon mouse hover.

The alternative splicing signature depicts the signature of the domainregion that is skipped or added during the alternative splicing process.Alternative splicing events that occur relative to the canonicaltranscript are highlighted in these exons. Skipped exons (or portions ofexons) in the selected transcript are shown in black, and any addedexons (or portions of exons) are shown in blue. The AA signature forskipped or added exon region is shown below the corresponding exonpositions.

Protein charts 800 aid in visualizing the data from both the seed andfull alignment for a selected domain ID. It visualizes the number ofnon-redundant AAs produced from the multiple sequence alignment in eachposition in the transcript. The number of amino acids and the domainposition is displayed on mouse hover of the peaks in the signature plot.

The cryptic splice sites section aids in visualizing the coding exon ofthe selected gene overlaid with splice sites based on the selectedthreshold score. The different types of splice sites are color coded:cryptic acceptors in red, cryptic donors in green, and real sites inblue. The scores calculated by employing the SS algorithm are depictedabove each site for donors and below each site for acceptors. The sitedetails like start position, end position, sequence, and score can bedisplayed by hovering over the marking on the coding exons.

Protein charts 800 allow visualizing the altered amino acids fallingwithin the set of allowed amino acids or the counterpart. The unique setof amino acids for each position of the domain from the seed or fullalignment files are depicted as stacks in the green region, whereas theamino acids other than the allowed set are depicted in the red region,showing the allowed and non-allowed amino acids. It is thought that ifthe altered amino acids fall within the allowed set, the function of thedomain is not affected. However, the domain's function is greatlystirred when the altered amino acid is not accounted for in the allowedset.

FIG. 8A illustrates protein chart 800A, including multiple sequencealignments of amino acid strings 801A for a protein coded in diversegenomes having identifiers 811. For each position set of “allowed” aminoacids at each sequence position, each are generated in ProtSig (using analgorithm described below), creating a signature of potential amino acidsubstitutions across the domain. These signatures are color-coded basedon multiple distinct parameters, such as the degree to which the aminoacids are hydrophobic or hydrophilic, and whether they correspond to aregion that is alternatively spliced. For example, a Glycine AA may beindicated by code 821, and a Proline AA may be indicated by code 823. Acode 825 may indicate a small or hydrophobic AA (e.g., C, A, V, L, I, M,F. W), a code 827 may indicate Hydroxyl or amine amino acid groups(e.g., S, T, N, Q), a code 829 may indicate charged amino acids (e.g.,D, E, R, K), and a code 831 may indicate a Histidine or Tyrosine aminoacid (e.g., H, Y). For every unique identifier 811, an alignment section801A is available in the seed/full file. Alignment sections 801A areparsed to identify the unique amino acids at each sequence positionalong with the gaps and are considered as amino acid stacks. A signaturefor the selected protein domain is created by these stacks present ineach position of the sequence alignment in strings 801A. The alignmentsection including all strings 801A is parsed such that at each position,the unique AAs from the alignment are taken including “.” (gaps) and areconsidered as stacks. For example, the stack for the ninth position is“.RKEY” (e.g., as can be verified by looking at all amino acid strings801A down from the 9^(th) letter—with many gene variants missing—). Thestacks in each of the positions in the alignment of a list ofidentifiers 811 forms the signature of the domain.

FIG. 8B includes protein chart 800B, including a signature impression ofthe amino acid substitutions that likely maintain the structure andfunction of a given protein region, and helps bridge the divide betweenprotein structure, function, splicing, mutations, and disease.Accordingly, the protein signature module converts the alignments inprotein chart 800A into a signature in protein chart 800B by identifyingthe variable amino acids and avoiding the redundant amino acids at eachposition in amino acid string 801B. The signature in protein chart 800Bis represented in graphical form as stacks of AAs for each position. Thestacks whose positions had gaps (“.”) in more than 50% of the totalnumber of sequences in the alignment are shown in grey boxes. Aselection tab 830 allows the user to switch between seed and fullalignment.

From the alignment, the human domain sequence alone is taken as such andrepresented in blue boxes 805 above the signature plot. This alsoincludes gaps 815. The secondary structure information of the aminoacids 807 is also provided in the signature of protein chart 800B. Thenumber of sequences considered in the alignment 803 are also providedalong with the number of gaps and number of amino acids strings 801Bother than gaps in each position of the domain signature in thesignature plot of protein chart 800B. In some embodiments, protein chart800B visualizes the number of non-redundant AAs produced from themultiple sequence alignment in each position in the transcript. Thenumber of amino acids and the domain position is displayed on mousehover of the peaks in the signature plot.

In some embodiments, a variable amino acid 810 v is displayed in proteinchart 800B when it occurs at least at greater than a specific fraction(e.g., 50%) of the aligned positions in protein chart 800A. Proteinsignature module identifies different amino acids at each position andincludes them as the variable or allowed amino acid 810 v at thatposition. The set of variable amino acids 810 v that does not alter theprotein functionality is referred to as an “allowable set.” Accordingly,for each position, any one of the 21 different available amino acidsthat are not in the allowable set belong to a “non-allowable set.” Thespecific modality for presentation of the chart may be selected by theuser via a select signature tab 810. For example, protein chart 800Bindicates a protein signature indicating stacks of allowed amino acidsrepresented in 20 colors (one color for each amino acid). Typically,when a mutation replaces an amino acid in the allowable set with anamino acid in the non-allowable set, the result is a dysfunctionalprotein, or a protein having a deleterious functionality. Any position805 with “.” or “-” in the alignment indicating a gap is taken intoaccount, whereby a position with a particular frequency (e.g., >50%) ofdots is defined as a grey region 815 in the signature. Grey region 815includes the least significant on the allowed amino acid set as it hasgaps or dots predominantly (e.g., >50%). Protein chart 800B includespositions 815 in the human domain sequence that contain a gap, but thecorresponding signature 815 h at those positions are not grey regions,indicating that there are more than 50% of amino acids at that positionin the alignment. In addition, there are positions in the human domainsequence containing amino acids but the corresponding signature is agrey region, meaning that there are more than 50% of gaps in thatposition in the alignment but the human sequence has an amino acid.Variable amino acids 810 v in the protein chart 800B play an importantrole in determining the pathogenicity of variants and their mutationalimpact and clinical significance in terms of protein functionality.Accordingly, the protein signature module defines a deleterious mutationas one changing an allowed amino acid to a non-allowed amino acid in thesignature of protein chart 800B. Accordingly, protein chart 800B enablesidentifying pathogenic mutations when the resulting amino acid falls inthe non-allowed set, and mutations resulting in amino acids falling inthe allowed set may be benign.

FIG. 8C includes protein chart 800C, which lists allowable sets 841 andnon-allowable sets 842 of amino acids for each amino acid position 801Cin a selected protein when the user selects a positive/negative displayin tab 810 (cf. protein chart 800B). In some embodiments, protein chart800C may validate and verify a designation of deleterious mutations 851from third party databases (e.g., database 252, including dbSNP or otherdatabases). Mutations 851 from a subject exhibiting a disease can alsobe overlaid on any of the protein signatures and on protein chart 800C.The mutated amino acids may be highlighted in a colored (e.g., ‘purple’)box in the signature plot. The mutation details are displayed on hoverof the purple box along with PolyPhen and SIFT information. Moreover,protein chart 800C may provide a mutation information 852 when mousehovering over the amino acid at position 30 (W—Tryptophan) leads tothree deleterious mutations (R, L, and C) that fall in the non-allowedregion (red), confirming that the algorithm based on this concept isvalid. In some embodiments, information 852 for a selected mutation maybe provided by a third party database (e.g., database 252, including theCOSMIC database) along with the number/frequency of samples (subjects)having those mutations in their studies. Furthermore, if a deleteriousmutation assessed by current methods falls within the green regionitself, this may indicate that the designation of the mutation as“deleterious” may be erroneous. Accordingly, protein chart 800C mayprovide a basis for testing whether a given variant is deleterious ornot (benign). In some embodiments, color codes may be used in proteinchart 800C to contrast allowable sets 841 (‘green’) with non-allowedsets 842 (‘red,’ ‘pink,’ or ‘salmon’). In some embodiments, proteincharts 800 include a color-coding of the amino acids based on theirhydropathy index values from blue to red from hydrophilic tohydrophobic. Blue boxes 805 and amino acids 807 are as described inchart 800B.

FIG. 8D illustrates protein chart 800D including the frequency/densityof different variants to visualize the number of samples for eachvariant in each protein domain position (e.g., as curated by the COSMICdatabase). A color code may be used for graphical aid. For example,positions in amino acid string 801A with a single variant arerepresented in red, and positions with more than one variant aredepicted as follows: two variants—blue, three variants—green, fourvariants—yellow, and more than four—magenta. The mutation position, ID,and amino acid change along with the number of samples are displayed onmouse hover of the peaks depicted in the plot. Protein chart 800Dillustrates mutation frequencies associated with domains 820-1 (CPSase LD2), 820-2 (Biotin carb C), 820-3 (CPSase L chain), and 820-4 (Biotinlipoyl) within the selected protein (hereinafter, collectively referredto as “protein domains 820”). The number of samples for each of thevariants at a specific position in amino acid string 801A of a domain820 may be retrieved from a third party database (e.g., database 252including the COSMIC database). Accordingly, protein chart 800D providesa visual indication of the number of variants at a specific position ina given domain 820 based on the different color codes.

The protein signature module enables the user to explore proteinsignatures via protein charts 800 by providing the ability to searchover one or more databases based on different criteria, such as genebased criteria, wherein the protein signature can be visualized byselecting the appropriate gene name along with its transcript ID. Orbased on the number of domains and families, wherein a dropdown menuincluding the number of domains and families includes values rangingfrom 1 to 304, or even more. On selecting a number from the dropdown,the genes with corresponding number of domains and families are listedand the very first gene is visualized as default. Protein charts 800 mayalso allow the user to search protein signatures based on an averagevalue of amino acid substitutions. Accordingly, protein charts 800 mayinclude a dropdown menu including, for example: 20-16, 15-11, 10-6, and5-1 amino acid substitutions. Protein charts 800 may also allow the userto search protein signatures based on a Pfam ID. Accordingly, a proteinsignature may be visualized by selecting an appropriate Pfam ID alongwith the gene name and transcript ID. Protein charts 800 may also allowthe user to search protein signatures based on the alignment type.Accordingly, the protein signature can be visualized by selecting theappropriate alignment type 830 (seed/full) along with the gene name andtranscript ID. Protein charts 800 may include an alignment type dropdownmenu with the following values: only seed, only full, and seed and full.Protein charts 800 may also allow the user to search protein signaturesbased on a clinical association. Accordingly, protein charts 800 enablethe user to select a disease category such as Germline cancer, Somaticcancer, ACMG panel, inherited disorder, Industrial panel, and DMG panelalong with the disease name. The protein signature can be displayed forthe selected gene based on the clinical association. Protein charts 800may also allow the user to search protein signatures based on exceptiongenes. Accordingly, protein charts 800 enable a user to visualize theprotein signature of genes falling under the following criteria: (i)Contains an in-frame stop codon: Displays genes having stop codons inthe reading frame; (ii) Contains a selenocysteine: Displays genes havingselenocysteine (unusual amino acid); and (iii) Contains no stop codon:Displays genes having no stop codon at the end of CDS.

FIGS. 9A-9F illustrate exemplary embodiments of un-translated portions900A, 900B, 900C, 900D, 900E, and 900F of a genome (hereinafter,collectively referred to as “UTRs 900”), according to embodimentsdisclosed herein. UTR 900A includes a nucleotide string 901A having apartially-coding exon 912, fully coding exons 910-1, 910-2, 910-3,910-4, and 910-5 (hereinafter, collectively referred to as “exons 910”),a partially-coding exon 913, and a poly-A site 917 (AATAAA or ATTAAA).UTRs 900 may be provided by a UTR view module, as disclosed herein(e.g., UTR view module 260-7).

Promoter elements such as TATA, GC, and CAAT aid in the initiation oftranscription at the transcription start site (TSS). There also existmultiple protein binding sites within the upstream sequences, which canextend up to several 1000 bases. Tumor suppressor genes such as TP53 andtranscription regulating genes such as OBSCN and TAF3, bind to specificsequence motifs within the promoter regions of many genes that theycontrol. A promoter is the binding site for the basal transcriptionalapparatus—RNA polymerase and its cofactors, which provides the minimummachinery necessary to allow transcription of the gene. The enhancerregions are found at a distance from the promoter, at the 5′ or 3′ sidesof the gene or within introns. They are typically short stretches of DNA(−200 bases), each made up of a cluster of even shorter sequences (e.g.,25 bases) that are the binding sites for a variety of transcriptionfactors. These transcription factor complexes interact with the basaltranscriptional machinery at the promoter to enhance (or sometimesdiminish) the transcription rate of the gene. Such interactions arepossible because of the flexible nature of DNA, which allows theenhancers to come close to the promoter by looping out the DNA inbetween.

UTRs 900 define a promoter motif as the combination of its shorterelements such as TATA, CAAT, and GC boxes. Some embodiments calculatescores for the promoter motif by combining the individual scores of theshorter elements with various weights. This score defines the strengthof the promoter. The motifs for other transcriptional regulatingsequences such as enhancers and silencers are also calculated similarly.The same method is applied for polyA sequences. Mutations in thesemotifs are recognized by the variations in these scores.

There are possibilities for the existence of cryptic versions of all ofthese regulatory elements such as promoters, poly-A sites, and enhancersand silencers of promoters and poly-A signals. Mutations within thesecryptic sites cause aberrations that can incorrectly enhance or suppressthe gene expression or translational mechanisms. UTR's 900 thus enable acomprehensive understanding of various elements including promoters,UTRs, poly-A sites, and their cryptic sites, and their interplay withsplicing and gene expression.

FIG. 9A illustrates UTR 900A, according to some embodiments.

FIG. 9B illustrates UTR 900B including a promoter motif 925 as thecombination of its shorter elements such as TATA, CAAT, and GC boxes.UTR 900B illustrates other enhancer elements 921 and 927, and silencerelements 923 and 929 that interact with promoter motif 925 to activateand engage as element 921 (e.g., RNA polymerase). In some embodiments,the UTR view module calculates the scores for promoter motif bycombining the individual scores of the shorter elements with variousweights. This score defines the strength of promoter motif 925. Promotermotifs 925 for other transcriptional regulating sequences such asenhancer elements 921 and 927 and silencer elements 923 and 929 arecalculated similarly. The same method is applied for poly-A sites 917.Mutations in promoter motifs 925 are recognized by the variations inthese scores. UTRs 900 identify real and cryptic promoter and poly-Asites 917 and elements by adapting and modifying relevant algorithms(e.g., algorithm 250, including MaxEntScan, NNSplice, and Human splicingFinder) by using appropriate PWMs, consensus sequences, and lengths ofthe different motifs and elements. The UTR view module also uses thesemodified algorithms to detect mutations throughout the genes in thegenome and its application to subject and cohort genomics.

Poly-A sites 917 present at the end of the coding sequence aid in thetransport of mRNA molecules from the nucleus to the cytoplasm where thetranslation process is initiated. There exist some elements upstream anddownstream of poly-A sites 917 acting as enhancers of polyadenylation.For example, a polyadenylation signal (PAS) may be placed 10-30 basesupstream of poly-A site 917, including a canonical sequence element,AATAAA. A T/GT-rich downstream sequence element (DSE) may be located upto 30 bases downstream of poly-A site 917, and T-rich upstream sequenceelements (TSE), located upstream of poly-A site 917. G-rich auxiliarydownstream elements (Aux-DSE) may be located downstream of the DSE, andTGTA motifs that may be found around a poly-A site 917. The secondarystructure information of the amino acids 807, may act as enhancers ofpolyadenylation. Mutations in poly-A sites 917 and the above enhancerelements suppress the polyadenylation and affect the translation processby inhibiting mRNA transport and other translational regulation.

FIG. 9C illustrates UTR view 900C including a start codon 912C (‘ATG’)for an mRNA, with the Kozak consensus score (Y-axis) for sequences 922-5upstream (5′ end to the left) and 922-3 downstream (3′ end to the right)of start codon 912C. The Kozak score for each motif is illustrated usingtheir modified versions based on their consensus sequences and lengths.For each position 901C in UTR view 900C, the Kozak score is indicatedfor any permutation of the corresponding nucleic acid (′A, C, G, T′).

The user can search for genes based on various criteria, whereby thecorresponding UTR view plot and its features for the selected gene arecomputed and displayed for interactive analysis. In some embodiments,UTR view module provides multiple search criteria to analyze thefeatures of UTR in genes including a search by gene. Accordingly, theuser may search genes based on the gene symbols from the dropdown. Insome embodiments, UTR view module provides a search criterion by thenumber of ORFs. In some embodiments, the search criterion is based onthe number of u-ORFs, d-ORFs, and ORFs ranging from 1 to >300 that arepresent in the gene. In some embodiments, UTR view module provides asearch by promoter box: Based on the type and number of promotersequences such as TATA, GC, and CAAT that are present in the gene. Insome embodiments, UTR view module provides a search by promoter score:Based on the calculated scores for promoter sequences such as TATA box,GC box, CAAT box, initiator box score, and average promoter score (forthe complete promoter motif) that are present in the gene. In someembodiments, a UTR view module provides a search by poly-A signal: Basedon the occurrence of poly-A sequence such as AATAAA or ATTAAA, or AATAAAand ATTAAA present in the gene. In some embodiments, UTR view moduleprovides a search by exon classes: Based on the exon classificationssuch as 5′ exons, 3′ exons, intron-less, and internal exons present inthe genes. In some embodiments, UTR view module provides a search byclinical association: The disease association of somatic cancer,germline cancer, inherited disorders, industrial panels, ACMG panels,DMG panels, and other possible panel sources are enabled in the dropdownlist. In some embodiments, UTR view module provides a search byexception genes: Genes that exhibit a rare characteristic exon behaviorsuch as an in-frame stop codon, selenocysteine codon, or no stop codonspresent in the end of CDS.

FIG. 9D illustrates UTR view 900D including a graphic payload resultfrom a query built by the user from various dropdown lists enabled bythe UTR view module, as described above. The results may be fetched fromthe database (e.g., one or more third party databases) and presented inthe form of a gene view. UTR view 900D facilitates the identification ofpre-splicing and post-splicing events in transcription and translationof CDS 910, UTRs 919, u-ORF 918-1, real-ORF 918-2, d-ORF 918-3(hereinafter, collectively referred to as “ORFs 917”), and poly-Asequences that are depicted in the gene and mRNA structure plot withrespect to nucleotide string 901D. ORFs 918 are delimited by a startcodon 912 a and one of three stop codons 912 d (‘TAG,’ ‘TGA,’ or ‘TAA’).

ORFs 918 are classified into different classes based on their position(upstream, ‘u’ or downstream, ‘d’) with respect to the true start codon912 a and stop codon 912 d, (4 u-ORFs, and 4 d-ORFs). Accordingly,u-ORFs 918-1 are defined as a sequence from an ATG that precedes thereal start codon 912 a to an in-frame stop codon 912 d that precedes orfollows the real start codon 912 a. A d-ORF 918-3 is defined as asequence from an ATG that follows real start codon 912 a to an in-framestop codon 912 d that precedes or follows real stop codon 912 d. ORFs918, promoters and poly-A signals 917 occurring in the gene transcriptare represented as per the color-coded schema. Upon clicking a u-ORF918-1 or a d-ORF 918-3, the corresponding sequence is highlighted in themRNA sequence view in addition to the 5′ and 3′ UTR, promoter, codingexons, start codon 912 a and stop codon 912 d, poly-A sites 917, and-ORF 918C, with color codes and popup window details.

FIG. 9E illustrates UTR view 900E, including a nucleotide sequence 901Eof a UTR section of a nucleotide string, with overlaid mutationsaccording to a third party database source (e.g., ClinVar, dbSNP, andCOSMIC), and are overlaid on these promoters, 5′ UTR and 3′ UTR elements919 such as Kozak sequence, u-ORFs, and d-ORFs (ORFs 918), and poly-Asites 917-1 and 917-2 along with their clinical significance. Mutationsfrom a subject genome and cohort genomes can also be visualized on UTRview 900E. A d-ORF 918C limited by start codon 912 a and stop codon 912d is also indicated. In some embodiments, UTR view 900E may also includealternating exons 960.

Scores for Kozak sequences and the 4-base stop codons are alsodetermined based on an algorithm (e.g., algorithm 250 including aShapiro & Senapathy algorithm) and may be illustrated/tabulated togetherwith UTR view 900E. Various tabs showing details of mRNA sequence,splice sites, and promoters are provided which enables the analysis ofvarious UTR elements through interactive graphics and tables. The cisand trans-acting enhancers of genes, their binding proteins, and theirinterplay in complex gene regulation, are also predicted using theidentification of the target sequences of these motifs and elements, andtheir aberration in disease.

A modified S&S algorithm as disclosed herein predicts the promoter boxes(e.g., TATA box, CAT box, GC box, initiator box) upstream of the gene.We found that it produces unique patterns for scores above 50, 60, 70,etc. for different score ranges. It also produces unique patterns ofdifferent promoter boxes upstream of a specific gene. We also observedthat some of these patterns such as the GC boxes correspond with theG-quadruplex DNA structure. It is observed that mutation in G-quadruplexenhances the promoter strength and causes overexpression of the gene.For example, a C-KIT gene promoter mutation causes overexpression of thetyrosine kinase enzyme leading to cancer. A drug called Gleevec has beensuccessfully developed to inhibit the kinase to treat Gastrointestinalstromal tumor (GIST). Thus, the unique repetitive GC box patternsproduced by Genome Explorer will aid in the recognition of clinicallysignificant mutations.

UTR view can also recognize mutations that weaken the promoter strengthand cause under expression of the gene. This approach applies toenhancers and silencers of promoters, and polyA sites or signals andtheir cryptic versions, in a broad range of several thousand basesupstream and downstream of the gene, and within the gene.

In this C-KIT gene example, the field targets to inhibit theoverexpressed tyrosine kinase activity for drug development. UsingSplice Atlas, we can also target to mask the predicted GC boxmutation(s) through RNA interference technologies such as siRNA andRNA-i. By adjusting the dose of the interference RNA, we can control theover or under expression, thus leading to the cure of the cancer. Byinhibiting the silencer activity, we can enhance the expression of agene and vice versa (by using the enhancer). This unique approach ofSplice Atlas will aid in the development of drugs for cancers and otherdiseases.

FIG. 9F illustrates UTR view 900F, which shows 200 bases sequenceupstream of the C-KIT gene wherein the different promoter boxes arecolor coded. The repeated GC box pattern (blue ticks) occurs formodified S&S scores of above 50.

FIGS. 10A-10B illustrate exemplary branch point views 1000A and 1000B(hereinafter, collectively referred to as “branch point views 1000”) ofbranch point sequences (BPS) 1050A, 1050B-1, and 1050B-2 in a genome(hereinafter, collectively referred to as BPS 1050), according toembodiments disclosed herein. Introns are non-coding sequences foundwithin the pre-mRNA transcripts that are removed during the splicingprocess. Splicing of pre-mRNA is assisted by the spliceosome, whichidentifies specific sequence motifs for the recognition of splice siteswithin the introns. Introns contain a donor splice site 1012 in their 5′end, and an acceptor splice site 1013 in their 3′ end. In someembodiments, BPS 1050 may be located anywhere from 15 to 40 nucleotidesupstream from the 3′ end of an intron. BPS 1050 is a highly conservedsplicing signal for spliceosome assembly and lariat formation. In someembodiments, BPS 1050 is a five base regulatory sequence that maycontain an Adenine at its fourth base. Accordingly, the spliceosomefirst cleaves the pre-mRNA at donor splice site 1012 following theattachment of an snRNP (U1) to its complementary sequence within theintron. The free end binds with BPS 1050 downstream through pairing of aG nucleotide from the 5′ end of U1 and an Adenine from BPS 1050, forminga loop known as a ‘lariat,’ releasing the intron as an RNA lariat, andcovalently combining the two exons from upstream and downstream the‘looped’ intron.

In some embodiments, BPS 1050 may be identified by using an algorithm(e.g., algorithm 250, including a modified Shapiro & Senapathy algorithmand other relevant algorithms) parsing the nucleotide string 1001 in theintron sequences upstream of 3′ end. In some embodiments, the algorithmis configured to identify a cryptic BPS 1050 within the gene.Accordingly, some embodiments provide a database for different BPS 1050s in the genome.

FIG. 10A illustrates branch point view 1000A, according to someembodiments.

FIG. 10B illustrates branch point view 1000B including a fully codingexon 1010 having a 5′ partially-coding end 1022-5 and a 3′partially-coding end 1022-3. A non-coding exon 1014 may also beidentified. Coding exon 1010 is delimited by a true acceptor 1002 a anda true donor 1002 d. Cryptic donor 1012 d and cryptic acceptor 1012 aare also identified. Branch point view 1000B illustrates a slidingwindow 1052 of variable sizes (e.g., 5 bases: ‘TTCAC’) and is applied onthe stripped sequence from 14 to 35 bases upstream of the 3′ intron end.All possible occurrences of 5-mers (for instance) are identified andtheir scores are calculated (e.g., based on the PWM). Among all the5-mers, the one with the highest score (and also above a selectedthreshold, e.g., 50) is considered as BPS 1050B-1 or BPS 1050B-2(hereinafter, collectively referred to as “BPS 1050B”). Also, BPS 1050is identified throughout the intron sequences, exons, and the completegene and are named as cryptic branch points using the same method. Whenthe scores of each of the 5-mers are lower than a selected threshold,the stripped sequence is again searched for the first occurrence of “A”from the 3′ end (e.g., from −14 to −35 bases). If an “A” is found, it isconsidered as the consensus A of BPS 1050B (4th base), three basesupstream (e.g., ‘AGC’), and one base downstream of that A (e.g., ‘G’)are then included in BPS 1050B. For example, “A” may occur at the −22position, and thus the branch point sequence identified is “AGCAG.”There may be a few recognizable species of BPS 1050B around thenon-canonical A base which can be identified and isolated based on avariety of signal identifying methodologies.

Branch point view 1000B also illustrates cryptic branch points 1055. Thescores and the branch point sequence for each of the identified real andcryptic sites are shown on mouse hover. The mutations 1057 from thedatabase sources such as dbSNP, ClinVar, and COSMIC occurring on thebranch sites, cryptic branch sites, splice sites, and cryptic splicesites are shown. On clicking any of the exons, introns, or mutations1057, the corresponding position in the expanded view automaticallyscrolled to focus. This enables the visualization and analysis of thevarious regulatory elements and their cryptic versions on the gene ortranscript. Cryptic branch points 1055 may have an impact in diseaseassociations on encountering mutations within them. Thus, the BPS viewmodule enables the visualization and deeper analysis of BPS 1050 andother regulatory elements and their cryptic versions, individually andin combinations, in a single application.

The BPS view platform may provide search capabilities for the useraccording to different search criteria such as a gene basis, to searchgenes by entering gene symbols. In some embodiments, the search criteriamay include a number of cryptic branch points, to search genes thatcontain a high frequency of cryptic branch point sites. In someembodiments, the search criteria may include a cryptic branch pointscore, to search genes that contain the highest (or one of the higher)cryptic branch point scores. In some embodiments, the search criteriamay include a clinical association, to search genes based on variousdisease panels, a drug metabolizing gene (DMG) panel, and the AmericanCollege of Medical Genetics and Genomics (ACMG) gene panel. In someembodiments, the search criteria may include an exception gene, tovisualize a BPS in genes which fall under the following criteria: (i)Contains an in-frame stop codon: Displays genes having stop codons inthe reading frame; (ii) Contains a selenocysteine: Displays genes havingselenocysteine (an unusual amino acid), and (iii) Contains no stopcodon: Displays genes having no stop codon at the end of CDS.

In some embodiments, a platform as disclosed herein enables a search forenhancers and silencers for any gene based on several search criteriasuch as a number of enhancers/silencers to search genes that contain ahigh frequency of enhancers/silencers above a score of a pre-selectedvalue (e.g., 70 or the highest). Search criteria may include anenhancers/silencers score, to search genes that contain highenhancers/silencers scores (e.g., the highest). Search criteria mayinclude a gene, to search genes by entering gene symbols. Searchcriteria may include a clinical association, to search genes based onvarious disease panels, a drug metabolizing gene (DMG) panel, and theAmerican College of Medical Genetics and Genomics (ACMG) gene panel.Search criteria may include an exception gene, to visualize the proteinsignature of the genes which falls under the following criteria: (i)Contains an in-frame stop codon: Displays genes having stop codons inthe reading frame; (ii) Contains a selenocysteine: Displays genes havingselenocysteine (unusual amino acid); and (iii) Contains no stop codon:Displays genes having no stop codon at the end of CDS.

FIGS. 11A-11B illustrate exemplary embodiments of non-coding RNA genes1100A and 1100B (hereinafter, collectively referred to as “ncRNA genes1100”), according to embodiments disclosed herein. The ncRNA genes 1100may be provided by an ncRNA map module as disclosed herein (e.g., ncRNAmap module 260-10).

The ncRNA genes from the genome are identified based on availableannotations. Graphical representation of tRNA, rRNA, miRNA, snoRNA,snRNA, and lncRNA genes in the ncRNA map is achieved by incorporating adedicated database. Sequence information for these ncRNA genes and theirexons are retrieved from SpliceDB and the graphical representation ofncRNAs are implemented. Known mutations from the data sources such asdbSNP, COSMIC, and ClinVar are depicted within the ncRNA genes in thecorresponding positions. In addition, mutations from individual subjectsand cohorts of subjects are also overlaid on the gene plot. The effectof mutations on the ncRNAs (such as defects in a tRNA leading toincorrect amino acid incorporation into proteins, or defects in miRNAgene leading to suppression of a specific gene expression ortranslation) are also predicted using the indigenous algorithm of thencRNA map module. Furthermore, identification of ncRNA genes overlappingwith the protein-coding genes is performed by comparing the coordinatesof ncRNA and protein-coding genes.

TABLE IV Number of genes Sequence length ncRNA type in the genome(spliced exons) rRNA 19 100-1,600 tRNA 447  59-86 miRNA 1,500  16-27snoRNA 388  33-350 snRNA 95  63-332

There exists variability in these ncRNAs, for instance, a specific tRNAacross multiple organisms, which helps in predicting the pathogenicityof a variant from a subject. When a mutated base falls in thenon-allowed region, the structure/function of the RNA molecule isgreatly altered whereas if it falls within the allowed set, thestructure/function of the RNA is not altered or slightly altered.Signatures for each type of the ncRNA genes are constructed byconsidering the non-redundant bases in each of the positions of thealigned ncRNA sequences from various organisms. The variable andinvariable positions from the ncRNA signatures are also identified. Theeffect of mutations are computed based on the allowed/non-allowed basesfrom the signature of the specific ncRNA genes.

The ncRNA map module may include a search engine that enables the userto search for portions of a nucleotide string in a subject genomeaccording to a menu of criteria. In some embodiments, the criteria mayinclude an ncRNA gene, to visualize splicing events for individualtranscripts for the selected gene. The criteria may also include thetype of ncRNA, to search and visualize specific types of non-coding RNAgenes. The criteria may include a clinical association, to search andvisualize splicing events for individual transcripts in genes implicatedin ncRNA gene panels for all major cancers and inherited disorders. Thecriteria may include overlapped genes, wherein coordinates of the RNAgenes are checked to identify whether they overlap with any of the genes(protein-coding) present and the overlapping genes are illustrated. Thecriteria may include a number of cryptic sites, to identify genes havinga high frequency of cryptic splice sites that can be searched based onthe number of cryptic sites. Cryptic splice sites can be visualized forindividual transcripts for the selected gene. The criteria may include acryptic site score to identify genes having high cryptic splice sitescores that can be searched based on the scores (with optionsfor >70, >80, and >90 to choose from). The cryptic splice sites can bevisualized for individual transcripts for the selected gene.

The ncRNA genes 1100 are plotted on nucleotide string 1101 along thegene length depicting exons and introns within them. These genesoverlapping with the protein-coding ones are also highlighted. The ncRNAgenes are plotted on the scale of the gene length depicting exons andintrons within them. These genes overlapping with the protein codingones are also highlighted. The sequences of these ncRNA genes are alsoprovided for further analysis. The mutations from the publicly availabledatabases, and the genomes of patients and cohorts, are marked on thencRNA gene view and the sequence view as well. The effect of mutationson the ncRNA genes are predicted and visualized for deeper analysis.There are possibilities of existence of cryptic splice sites, promoters,enhancers, and silencers for these ncRNA genes which are also identifiedusing the modified S&S and other relevant algorithms and visualized onthe gene view and sequence view. The mutations on these crypticregulatory sites from known data sources, individual patients andcohorts of patients are visualized on the gene and sequence view.

FIG. 11A illustrates a specific sequence 1121 of ncRNA gene 1100A thatmay be provided for further analysis. A mutation 1150A is indicatedwithin sequence 1125. In some embodiments, mutation 1150A is identifiedfrom a third party database, and the genomes of subjects and cohorts.The effect of mutation 1150A on ncRNA gene 1100A may be predicted andvisualized for deeper analysis by the ncRNA map module, and provided inthe graphic payload upon a mouse over by the user.

FIG. 11B illustrates ncRNA map 1100B, including pop up window 1150B.There are possibilities of existence of cryptic splice sites, promoters,enhancers, and silencers for these ncRNA genes which are also identifiedusing an algorithm (e.g., algorithm 250 including a modified Shapiro &Senapathy algorithm and other relevant algorithms) and visualized on thegene view and sequence view. The nucleotide variability within the ncRNAgenes may be displayed as stacks to form signatures. Mutations withinthe ncRNA gene can be visualized in these signatures, and pathogenicity,and their disease associations can be analyzed.

A database coupled to the ncRNA map module (e.g., database 252) includesdesirable details such as sequence annotation, splice sites, crypticsplice sites, promoter, branch points, poly-A, and known mutationinformation. In some embodiments, the database may include informationfor regulatory and splicing elements such as promoters, UTRs, splicedonor, acceptor and branch points, poly-A sites, enhancers and silencersof gene regulation, and splicing from different data sources (e.g.,NCBI, PFAM, PfamScan, Ensembl, PDB, UniProt, ClinVar, dbSNP, COSMIC,Variant Effect Predictor, PolyPhen, SIFT), and added scores for each ofthese elements based on modified Shapiro & Senapathy and other relevantalgorithms, and accumulated these information for genes in the humangenome into a unified database. In addition, the database may includethe positions and sequences of the cryptic versions of each of theregulatory and splicing elements throughout each of the genes andintegrate them into this database. Furthermore, the database includesaccumulated information from intergenic regions from the whole humangenome. In some embodiments, the database is designed to search for asubject mutation and overlay them on the gene structure and sequence.

In addition, various types of ncRNA genes are predicted within the darkmatter genome using several prediction algorithms such as tRNAscan-SE,tRNA-DL, miRDB, miRIAD, LncFinder, and PLAIDOH. We will use multipletools for each type of ncRNAs to ensure that any genuine ncRNA genes arenot missed. We will also use our proprietary algorithms to identifythese ncRNAs based on the variable sequence matrix specific for eachtype of ncRNA that are split into shorter variable sequence signatures.

FIG. 12 illustrates a process 1200 for finding a variable and anon-variable sequence signature of a protein, according to someembodiments. The variable amino acids sequence signature of a domainfrom many different organisms is based on their MSA. We now have come upwith a method to obtain the variable sequence signature of the domainusing protein sequence from the same organism. This approach has severaladvantages: 1) it avoids unknown gaps that arise from multipleorganisms; 2) there are many unique orphan proteins and domains presentin different organisms. These orphan domains are missed in the MSA frommultiple organisms. However, when we align the same protein sequencefrom numerous individuals of the same organism, it will lead todiscovery of new domains that are not possible from the MSA of multipleorganisms.

The new domains will be demarcated by a variability that is thecharacteristic of the genuine domains in which highly variable,invariable, and low variable AAs will be present in a recognizablemanner Mutations can also be detected and correlated with disease anddrug response phenotypes using all the genetic elements of the gene.

Currently, a construct of the PWM of splicing and regulatory elementsbased on each type of element from a given organism. We now have come upwith a method to obtain the variable sequence signature of a specifictype of element, for example a donor, in a specific exon in a specificgene (e.g., TP53, exon 3, donor) by MSA of the same donor from numerousindividuals.

Process 1200 may enable the discovery of unidentified elements. Forexample, promoter sequences are yet to be identified clearly in manygenes, especially within multiple binding sites for regulatory proteins.However, when we align the genome sequence from numerous individuals ofthe same organism, it will lead to the discovery of new promoters,poly-A sites and signals, enhancers, silencers, and binding sites (andother elements) for binding regulatory proteins from the multiplesequence alignment. The MSA of the genome sequences of numerousindividuals shows less variable positions in the binding sites comparedto other positions that helps in the identification of new binding sitesthat are not possible to discover from the generic approach. It alsohelps to identify the mutations in these elements from an individualmore easily, as it will show up as a rare variation (e.g., 0.0001%), asto be an outlier, that can be easily recognized.

USPACE=20×20×20=8,000 AA sequences

VSPACE=2×4×3=24 sequences

NVSPACE=USPACE−VSPACE=8,000−24=7,976 AA sequences

[2 4 3 Trp Glu Asp Ser Ala Arg Phe Gly Tyr]→ . . . VSPACE

VSIG=AA group 1 (Phe, Ser)—AA group 2 (Gly, Ala, Glu, Trp)—AA group 3(Tyr, Arg, Asp)

The variable and non-variable sequence signature of the 2 repressorprotein. (A) The allowed AAs (green) and non-allowed AAs (red) at eachposition of a 17-AA sequence portion of 2 repressor (as experimentallydetermined) represent the VSIG and NVSIG of the protein. Even one AAchange at a single position that diverges from the allowed AAs will makethe protein defective. (B) The VSPACE, NVSPACE, and USPACE of a protein(the example shows a sequence of three AAs). The USPACE is the set ofmany possible sequences created by the combination of many of the twentyAAs at each sequence position. The VSPACE is defined as the set of AAsequences formed by every combination of the allowed AAs at eachposition. The NVSPACE is the USPACE-VSPACE.

The figure shows how the Amino Acid Sequence Variability (AAV) isconstructed experimentally. We have described an algorithm for theconstruction of the variable amino acids sequence signature of a domainfrom many different organisms based on their pattern of multiplesequence alignment (MSA) in this disclosure. It has the difficulties ofintroducing sequence gaps and possible erroneous amino acids at somepositions. We describe here a method to obtain the variable sequencesignature of a domain using the protein sequences from differentindividuals of the same organism. This approach has severaladvantages: 1) it avoids unknown gaps that arise from multipleorganisms, 2) it avoids sequence errors, and 3) it is expected topredict many unique orphan proteins and domains that are present in anorganism that are not present in the other 108 distinct organisms. Theseorphan domains are missed in the multiple sequence alignment frommultiple organisms. However, when we align the same protein sequencefrom numerous individuals of the same organism, it will lead todiscovery of new domains and proteins that are not possible from the MSAof multiple organisms, and defining the AAV of these new domains in theprocess.

The new domains will be defined by a variability that is characteristicof the genuine domains in which highly variable, invariable, and lowvariable AAs will be present in a recognizable manner Mutations can alsobe detected and correlated with disease and drug response phenotypesusing many of the genetic elements of the gene, and the +ve and −ve AAVsignatures that has been defined above.

This approach identifies new proteins and domains in groups of distinctorganisms each consisting of similar species, such as mammals,crustaceans, or mollusks, or different groups of plants.

FIG. 13 is a flowchart illustrating steps in a method 1300 foridentifying and displaying a cryptic site in a nucleotide string,according to some embodiments. Each one or more of the steps in method1300 may be performed at least partially by a processor executinginstructions stored in a memory of a client device or a servercommunicatively coupled with each other via communications modulesaccessing a network, as disclosed herein (e.g., processors 212, memories220, communications modules 218, client device 110, and server 130). Insome embodiments, at least one or more of the steps in method 1300 maybe performed by an application hosted by the server and installed in theclient device, the application including a graphic display forillustrating the results of at least one or more of the steps in method1300 (e.g., application 222 and graphic display 225). In someembodiments, method 1300 may be at least partially performed by a genomesequence analysis engine in the server, the genome sequence analysisengine including a sequence scoring tool, a mutation tool, a statisticstool, and an algorithm tool (e.g., genome sequence analysis engine 242,sequence scoring tool 244, mutation tool 246, statistics tool 248, andalgorithm 250). Further, in some embodiments, one or more of the stepsin method 1300 may be performed by an exon splice module, a crypticsplice module, an exon chart module, an alternative splice module, anexon frame module, a protein signature module, a UTR view module, a BPSview module, a regulatory module, an ncRNA map module, and a dark mattermodule interacting with the genome sequence analysis engine, consistentwith the present disclosure (e.g., modules 260). In some embodiments, amethod consistent with the present disclosure may include at least oneof the steps in method 1300 performed in any order, simultaneously withone another, quasi-simultaneously, or overlapping in time.

Step 1302 includes identifying, in a nucleotide string, at least twoexons, at least one acceptor, at least one donor, and at least oneintron between the at least two exons. In some embodiments, step 1302includes identifying, in the nucleotide string, a first exon that lacksthe acceptor and contains the donor, and identifying, in the first exon,an open reading frame between a start codon for a gene and the donor. Insome embodiments, step 1302 includes identifying, in the nucleotidestring, a last one exon that contains the acceptor and lacks the donor,and identifying an open reading frame between the acceptor and aterminator codon for a gene. In some embodiments, step 1302 includesidentifying, in the nucleotide string, a branch point within the intron,the branch point being associated with a splicing site of the nucleotidestring to combine the two exons. In some embodiments, step 1302 includesidentifying, in a nucleotide string, a mutation, wherein the mutationincludes a modification in at least one of the two exons, the intron,the acceptor or the donor, and optionally a branch point, andgraphically marking, in the display for the user, the mutation in thenucleotide string. In some embodiments, step 1302 includes identifying,within an exon or the intron, a splice enhancer including a binding sitefor a spliceosome enhancer factor that promotes a splicing of exons of agene, wherein the gene includes at least a portion of the exon and theintron. In some embodiments, step 1302 includes identifying, within anexon or the intron, a splice silencer site including a binding site foran inhibitor factor that suppresses a splicing of exons of a gene,wherein the gene includes at least a portion of the exon and the intron.In some embodiments, step 1302 includes determining a deleteriousnessscore of a mutation of the true splice site or the cryptic splice sitebased on the similarity score. In some embodiments, step 1302 includesdetermining the similarity score by executing instructions from analgorithm selected from a group consisting of a Shapiro & Senapathyalgorithm, a MaxEntScan algorithm, and NNSplice algorithm, stored in amemory. In some embodiments, step 1302 includes identifying, in thenucleotide string, a cryptic exon that includes at least one crypticacceptor and one cryptic donor, and optionally, an open reading frame,between the cryptic acceptor and the cryptic donor, when a crypticsplice site score is higher than a pre-selected threshold, and a lengthof the cryptic exon conforms to a pre-selected threshold. In someembodiments, step 1302 includes optionally identifying a cryptic branchpoint upstream of the cryptic exon.

Step 1304 includes identifying, in the nucleotide string, a cryptic siteincluding a sequence of nucleotides based on a similarity score with atleast one of the acceptor and the donor.

Step 1306 includes graphically marking, in a display for a user, thenucleotide string at a location indicative of an exon, an intron, a truesplice site, and optionally a cryptic splice site when the similarityscore is higher than a pre-selected threshold.

FIG. 14 is a flowchart illustrating steps in a method 1400 for creatingand displaying a protein signature in an amino acid string, according tosome embodiments. Each one or more of the steps in method 1400 may beperformed at least partially by a processor executing instructionsstored in a memory of a client device or a server communicativelycoupled with each other via communications modules accessing a network,as disclosed herein (e.g., processors 212, memories 220, communicationsmodules 218, client device 110, and server 130). In some embodiments, atleast one or more of the steps in method 1400 may be performed by anapplication hosted by the server and installed in the client device, theapplication including a graphic display for illustrating the results ofat least one or more of the steps in method 1400 (e.g., application 222and graphic display 225). In some embodiments, method 1400 may be atleast partially performed by a genome sequence analysis engine in theserver, the genome sequence analysis engine including a sequence scoringtool, a mutation tool, a statistics tool, and an algorithm tool (e.g.,genome sequence analysis engine 242, sequence scoring tool 244, mutationtool 246, statistics tool 248, and algorithm 250). Further, in someembodiments, one or more of the steps in method 1400 may be performed byan exon splice module, a cryptic splice module, an exon chart module, analternative splice module, an exon frame module, a protein signaturemodule, a UTR view module, a BPS view module, a regulatory module, anncRNA map module, and a dark matter module interacting with the genomesequence analysis engine, consistent with the present disclosure (e.g.,modules 260). In some embodiments, a method consistent with the presentdisclosure may include at least one of the steps in method 1400performed in any order, simultaneously with one another,quasi-simultaneously, or overlapping in time.

Step 1402 includes identifying a first amino acid string correspondingto a functional protein or protein domain. In some embodiments, step1402 includes identifying an amino acid that is different from anallowable amino acid as a disallowed amino acid at the aligned location.In some embodiments, step 1402 includes identifying, in a nucleotidestring, a positive signature when the nucleotide string codes an allowedamino acid in the functional protein, and a negative signature when thenucleotide string codes a non-allowed amino acid in the functionalprotein. In some embodiments, step 1402 includes graphically marking amutation of the nucleotide string on the positive signature and thenegative signature. In some embodiments, step 1402 includes optionallydetermining a deleterious effect of the mutation based on whether themutation occurs within the positive signature or the negative signature.In some embodiments, step 1402 includes identifying, in a nucleotidestring coding a protein domain in the functional protein, a mutationleading to a disallowed amino acid, and determining a mutated hydropathysignature of the protein domain based on a hydropathy of a mutated aminoacid. In some embodiments, step 1402 includes determining a normalhydropathy signature of the protein domain based on a hydropathy of anallowed amino acid or a disallowed amino acid and determining adeleteriousness score for the mutation based on a difference between themutated hydropathy signature of the protein domain and the normalhydropathy signature of the protein domain. In some embodiments, step1402 includes determining a deleteriousness score for the mutation basedon whether a mutation occurs within a positive signature indicating nodeleteriousness or a negative signature indicating a deleteriousness.

Step 1404 includes aligning the first amino acid string with at leastone additional amino acid string that encodes a functional variant ofthe functional protein.

Step 1406 includes identifying, at each amino acid position within theadditional amino acid string, multiple variable amino acids that appearin the at least one additional amino acid string for each alignedlocation in the first amino acid string.

Step 1408 includes graphically marking, in a display for a user, avariable amino acid as an allowable amino acid at an aligned location inthe first amino acid string. In some embodiments, step 1408 includesstacking a non-redundant amino acid at each position of the additionalamino acid string in the functional protein. In some embodiments, step1408 includes graphically distinguishing, in the display for the user,the allowed amino acid and a disallowed amino acid at each alignedlocation. In some embodiments, step 1408 includes graphically indicatinga hydropathy of each variable amino acid at each aligned location.

FIG. 15 is a flowchart illustrating steps in a method 1500 foridentifying and displaying a cryptic promoter site in a nucleotidestring, according to some embodiments. Each one or more of the steps inmethod 1500 may be performed at least partially by a processor executinginstructions stored in a memory of a client device or a servercommunicatively coupled with each other via communications modulesaccessing a network, as disclosed herein (e.g., processors 212, memories220, communications modules 218, client device 110, and server 130). Insome embodiments, at least one or more of the steps in method 1500 maybe performed by an application hosted by the server and installed in theclient device, the application including a graphic display forillustrating the results of at least one or more of the steps in method1500 (e.g., application 222 and graphic display 225). In someembodiments, method 1500 may be at least partially performed by a genomesequence analysis engine in the server, the genome sequence analysisengine including a sequence scoring tool, a mutation tool, a statisticstool, and an algorithm tool (e.g., genome sequence analysis engine 242,sequence scoring tool 244, mutation tool 246, statistics tool 248, andalgorithm 250). Further, in some embodiments, one or more of the stepsin method 1500 may be performed by an exon splice module, a crypticsplice module, an exon chart module, an alternative splice module, anexon frame module, a protein signature module, a UTR view module, a BPSview module, a regulatory module, an ncRNA map module, and a dark mattermodule interacting with the genome sequence analysis engine, consistentwith the present disclosure (e.g., modules 260). In some embodiments, amethod consistent with the present disclosure may include at least oneof the steps in method 1500 performed in any order, simultaneously withone another, quasi-simultaneously, or overlapping in time.

Step 1502 includes identifying, in a nucleotide string, at least twoexons, and at least one intron between the at least two exons, and apromoter sequence. In some embodiments, identifying the promotersequence in step 1502 includes identifying at least one of a TATA box, aCAAT box, a GC box, and an initiator box. In some embodiments,identifying the promoter sequence in step 1502 includes identifying aTATA box, CAAT box, GC box, and initiator box, and, in addition,enhancers and silencers. In some embodiments, step 1502 includesdetermining the similarity score by executing instructions from analgorithm selected from a group consisting of a Shapiro & Senapathyalgorithm, a MaxEntScan algorithm, and NNSplice algorithm, stored in amemory.

Step 1504 includes selecting, within the nucleotide string, a crypticpromoter site including a sequence of nucleotides resembling thepromoter sequence.

Step 1506 includes associating a score to the cryptic promoter sitebased on a similarity score between the cryptic promoter site and thepromoter sequence. In some embodiments, the similarity score includes acombination of one or more of a TATA box, a CAAT box, a GC box, and aninitiator box. In some embodiments, step 1506 includes determining thesimilarity score by executing instructions from an algorithm selectedfrom a group consisting of a Shapiro & Senapathy algorithm, a MaxEntScanalgorithm, and NNSplice algorithm, stored in a memory.

Step 1508 includes graphically marking, in a display for a user, thenucleotide string at a location indicative of the cryptic promoter sitewhen the score is higher than a pre-selected threshold.

FIG. 16 is a flowchart illustrating steps in a method 1600 foridentifying and displaying a cryptic poly-A site in a nucleotide string,according to some embodiments. Each one or more of the steps in method1600 may be performed at least partially by a processor executinginstructions stored in a memory of a client device or a servercommunicatively coupled with each other via communications modulesaccessing a network, as disclosed herein (e.g., processors 212, memories220, communications modules 218, client device 110, and server 130). Insome embodiments, at least one or more of the steps in method 1600 maybe performed by an application hosted by the server and installed in theclient device, the application including a graphic display forillustrating the results of at least one or more of the steps in method1600 (e.g., application 222 and graphic display 225). In someembodiments, method 1600 may be at least partially performed by a genomesequence analysis engine in the server, the genome sequence analysisengine including a sequence scoring tool, a mutation tool, a statisticstool, and an algorithm tool (e.g., genome sequence analysis engine 242,sequence scoring tool 244, mutation tool 246, statistics tool 248, andalgorithm 250). Further, in some embodiments, one or more of the stepsin method 1600 may be performed by an exon splice module, a crypticsplice module, an exon chart module, an alternative splice module, anexon frame module, a protein signature module, a UTR view module, a BPSview module, a regulatory module, an ncRNA map module, and a dark mattermodule interacting with the genome sequence analysis engine, consistentwith the present disclosure (e.g., modules 260). In some embodiments, amethod consistent with the present disclosure may include at least oneof the steps in method 1600 performed in any order, simultaneously withone another, quasi-simultaneously, or overlapping in time.

Step 1602 includes identifying, in a nucleotide string, a poly-Aaddition site, wherein the poly-A addition site includes a poly-A siteand a signal. In some embodiments, step 1602 includes identifying asignal that includes a nucleotide string that signals an appearance ofthe poly-A site near the signal. In some embodiments, step 1602 includesidentifying, in the nucleotide string, an enhancer of a poly-A site. Insome embodiments, step 1602 includes identifying, in the nucleotidestring, a silencer of a poly-A site.

Step 1604 includes selecting, within the nucleotide string, a crypticpoly-A site, the cryptic poly-A site including a sequence of nucleotidesresembling at least one of the poly-A sites.

Step 1606 includes associating a similarity score to the cryptic poly-Asite based on a similarity between the cryptic poly-A site and a realpoly-A site. In some embodiments, step 1606 includes determining thesimilarity score by executing instructions from an algorithm selectedfrom a group consisting of a Shapiro & Senapathy algorithm, a MaxEntScanalgorithm, and NNSplice algorithm, stored in a memory.

Step 1608 includes graphically marking, in a display for a user, thenucleotide string at a location indicative of the cryptic poly-A sitewhen the similarity score is higher than a pre-selected threshold. Insome embodiments, step 1608 includes graphically marking in the displayfor the user, a real poly-A site.

Hardware Overview

FIG. 17 is a block diagram illustrating an example computer system withwhich the client and server of FIGS. 1 and 2 and the methods of FIGS.13-16 can be implemented. In certain aspects, the computer system 1700may be implemented using hardware or a combination of software andhardware, either in a dedicated server, or integrated into anotherentity, or distributed across multiple entities.

Computer system 1700 (e.g., client device 110 and server 130) includes abus 1708 or other communication mechanism for communicating information,and a processor 1702 (e.g., processors 212) coupled with bus 1708 forprocessing information. By way of example, the computer system 1700 maybe implemented with one or more processors 1702. Processor 1702 may be ageneral-purpose microprocessor, a microcontroller, a Digital SignalProcessor (DSP), an Application Specific Integrated Circuit (ASIC), aField Programmable Gate Array (FPGA), a Programmable Logic Device (PLD),a controller, a state machine, gated logic, discrete hardwarecomponents, or any other suitable entity that can perform calculationsor other manipulations of information.

Computer system 1700 can include, in addition to hardware, code thatcreates an execution environment for the computer program in question,e.g., code that constitutes processor firmware, a protocol stack, adatabase management system, an operating system, or a combination of oneor more of them stored in an included memory 1704 (e.g., memories 220),such as a Random Access Memory (RAM), a flash memory, a Read-Only Memory(ROM), a Programmable Read-Only Memory (PROM), an Erasable PROM (EPROM),registers, a hard disk, a removable disk, a CD-ROM, a DVD, or any othersuitable storage device, coupled with bus 1708 for storing informationand instructions to be executed by processor 1702. The processor 1702and the memory 1704 can be supplemented by, or incorporated in, specialpurpose logic circuitry.

The instructions may be stored in the memory 1704 and implemented in oneor more computer program products, e.g., one or more modules of computerprogram instructions encoded on a computer-readable medium for executionby, or to control the operation of, the computer system 1700, andaccording to any method well known to those of skill in the art,including, but not limited to, computer languages such as data-orientedlanguages (e.g., SQL, dBase), system languages (e.g., C, Objective-C,C++, Assembly), architectural languages (e.g., Java, .NET), andapplication languages (e.g., PHP, Ruby, Perl, Python). Instructions mayalso be implemented in computer languages such as array languages,aspect-oriented languages, assembly languages, authoring languages,command line interface languages, compiled languages, concurrentlanguages, curly-bracket languages, dataflow languages, data-structuredlanguages, declarative languages, esoteric languages, extensionlanguages, fourth-generation languages, functional languages,interactive mode languages, interpreted languages, iterative languages,list-based languages, little languages, logic-based languages, machinelanguages, macro languages, metaprogramming languages, multi paradigmlanguages, numerical analysis, non-English-based languages,object-oriented class-based languages, object-oriented prototype-basedlanguages, off-side rule languages, procedural languages, reflectivelanguages, rule-based languages, scripting languages, stack-basedlanguages, synchronous languages, syntax handling languages, visuallanguages, wirth languages, and xml-based languages. Memory 1704 mayalso be used for storing temporary variable or other intermediateinformation during execution of instructions to be executed by processor1702.

A computer program as discussed herein does not necessarily correspondto a file in a file system. A program can be stored in a portion of afile that holds other programs or data (e.g., one or more scripts storedin a markup language document), in a single file dedicated to theprogram in question, or in multiple coordinated files (e.g., files thatstore one or more modules, subprograms, or portions of code). A computerprogram can be deployed to be executed on one computer or on multiplecomputers that are located at one site or distributed across multiplesites and inter-coupled by a communication network. The processes andlogic flows described in this specification can be performed by one ormore programmable processors executing one or more computer programs toperform functions by operating on input data and generating output.

Computer system 1700 further includes a data storage device 1706 such asa magnetic disk or optical disk, coupled with bus 1708 for storinginformation and instructions. Computer system 1700 may be coupled viainput/output module 1710 to various devices. Input/output module 1710can be any input/output module. Exemplary input/output modules 1710include data ports such as USB ports. The input/output module 1710 isconfigured to connect to a communications module 1712. Exemplarycommunications modules 1712 (e.g., communications modules 218) includenetworking interface cards, such as Ethernet cards and modems. Incertain aspects, input/output module 1710 is configured to connect to aplurality of devices, such as an input device 1714 (e.g., input device214) and/or an output device 1716 (e.g., output device 216). Exemplaryinput devices 1714 include a keyboard and a pointing device, e.g., amouse or a trackball, by which a user can provide input to the computersystem 1700. Other kinds of input devices 1714 can be used to providefor interaction with a user as well, such as a tactile input device,visual input device, audio input device, or brain-computer interfacedevice. For example, feedback provided to the user can be any form ofsensory feedback, e.g., visual feedback, auditory feedback, or tactilefeedback; and input from the user can be received in any form, includingacoustic, speech, tactile, or brain wave input. Exemplary output devices1716 include display devices, such as an LCD (liquid crystal display)monitor, for displaying information to the user.

According to one aspect of the present disclosure, the client device 110and server 130 can be implemented using a computer system 1700 inresponse to processor 1702 executing one or more sequences of one ormore instructions contained in memory 1704. Such instructions may beread into memory 1704 from another machine-readable medium, such as datastorage device 1706. Execution of the sequences of instructionscontained in main memory 1704 causes processor 1702 to perform theprocess steps described herein. One or more processors in amulti-processing arrangement may also be employed to execute thesequences of instructions contained in memory 1704. In alternativeaspects, hard-wired circuitry may be used in place of or in combinationwith software instructions to implement various aspects of the presentdisclosure. Thus, aspects of the present disclosure are not limited toany specific combination of hardware circuitry and software.

RECITATION OF EMBODIMENTS

The subject technology is illustrated, for example, according to variousaspects described below. Various examples of aspects of the subjecttechnology are described as numbered embodiments. These are provided asexamples, and do not limit the subject technology.

Embodiment I: a computer-implemented method includes identifying, in anucleotide string, at least two exons, at least one acceptor, at leastone donor, and at least one intron between the at least two exons,identifying, in the nucleotide string, a cryptic splice site including asequence of nucleotides based on a similarity score with at least one ofthe acceptor or the donor, and graphically marking, in a display for auser, the nucleotide string at a location indicative of an exon, anintron, a true splice site, and optionally a cryptic splice site whenthe similarity score is higher than a pre-selected threshold.

Embodiment II: a computer-implemented method includes identifying afirst amino acid string corresponding to a functional protein or proteindomain, aligning said first amino acid string with at least oneadditional amino acid string that encodes a functional variant of saidfunctional protein, identifying, at each amino acid position within saidadditional amino acid string, multiple variable amino acids that appearin the at least one additional amino acid string for each alignedlocation in the first amino acid string, and graphically marking, in adisplay for a user, a variable amino acid as an allowable amino acid atan aligned location in said first amino acid string.

Embodiment III: a computer-implemented method includes identifying, in anucleotide string, at least two exons, and at least one intron betweenthe at least two exons, and a promoter sequence, selecting, within thenucleotide string, a cryptic promoter site including a sequence ofnucleotides resembling the promoter sequence, associating a score to thecryptic promoter site based on a similarity score between the crypticpromoter site and the promoter sequence, and graphically marking, in adisplay for a user, the nucleotide string at a location indicative ofthe cryptic promoter site when the score is higher than a pre-selectedthreshold.

Embodiment IV: a computer-implemented method includes identifying, in anucleotide string, a poly-A addition site, wherein the poly-A additionsite includes a poly-A site and a signal, selecting, within thenucleotide string, a cryptic poly-A site, the cryptic poly-A siteincluding a sequence of nucleotides resembling at least one of thepoly-A sites, associating a similarity score to the cryptic poly-A sitebased on a similarity between the cryptic poly-A site and a real poly-Asite, and graphically marking, in a display for a user, the nucleotidestring at a location indicative of the cryptic poly-A site when thesimilarity score is higher than a pre-selected threshold.

Embodiment V: a computer-implemented method including identifying afirst nucleotide string corresponding to a functional non-coding RNAgene and aligning said first nucleotide string with at least oneadditional nucleotide string that specifies a functional variant of saidncRNA gene. The computer-implemented method includes identifying, ateach nucleotide position within said additional nucleotide string,multiple variable nucleotides that appear in the at least one additionalnucleotide string for each aligned location in the first nucleotidestring and graphically marking, in a display for a user, a variablenucleotide as an allowable nucleotide at an aligned location in saidfirst nucleotide string.

Embodiment VI: a computer-implemented method including identifying afirst nucleotide string corresponding to a non-coding RNA gene andaligning said first nucleotide string with at least one additionalnucleotide string that specifies a functional variant of said non-codingRNA gene. The computer-implemented method includes identifying, at eachnucleotide position within said additional nucleotide string, multiplevariable nucleotides that appear in the at least one additionalnucleotide string for each aligned location in the first nucleotidestring, and graphically marking, in a display for a user, a variablenucleotide as an allowable nucleotide at an aligned location in saidfirst nucleotide string.

Embodiments I, II, III, IV, V, and VI may include any one of the belowrecited elements in any combination and number:

Element 1, further including identifying, in the nucleotide string, afirst exon that lacks the acceptor and contains the donor, andidentifying, in the first exon, an open reading frame between aninitiator codon for a gene and the donor. Element 2, further includingidentifying, in the nucleotide string, a last exon that contains theacceptor and lacks the donor, and identifying an open reading framebetween the acceptor and a terminator codon for a gene. Element 3,further including identifying, in the nucleotide string, a branch pointwithin the intron, the branch point being associated with a splicingsite of the nucleotide string to combine the two exons. Element 4,further including identifying, in a nucleotide string, a mutation,wherein the mutation includes a modification in at least one of the twoexons, the intron, the acceptor or the donor, and optionally a branchpoint, and graphically marking, in the display for the user, themutation in the nucleotide string. Element 5, further includingidentifying, within an exon or the intron, a splice enhancer including abinding site for a spliceosome enhancer factor that promotes a splicingof exons of a gene, wherein the gene includes at least a portion of theexon and the intron. Element 6, further including identifying, within anexon or the intron, a splice silencer site including a binding site foran inhibitor factor that suppresses a splicing of exons of a gene,wherein the gene includes at least a portion of the exon and the intron.Element 7, further including determining a deleteriousness score of amutation of the true splice site or the cryptic splice site based on thesimilarity score. Element 8, further including determining thesimilarity score by executing instructions from an algorithm selectedfrom a group consisting of a Shapiro-Senapathy algorithm, a MaxEntScanalgorithm, and NNSplice algorithm, stored in a memory. Element 9,further including: identifying, in the nucleotide string, a cryptic exonthat includes at least one cryptic acceptor and one cryptic donor, andoptionally, an open reading frame, between the cryptic acceptor and thecryptic donor, when a cryptic splice site score is higher than apre-selected threshold, and a length of the cryptic exon conforms to apre-selected threshold; and optionally identifying a cryptic branchpoint upstream of the cryptic exon.

Element 10, further including identifying an amino acid that isdifferent from an allowable amino acid as a disallowed amino acid at thealigned location. Element 11, wherein graphically marking the variableamino acids includes stacking a non-redundant amino acid at eachposition of the additional amino acid string in the functional protein.Element 12, further including graphically distinguishing, in the displayfor the user, the allowed amino acid and a disallowed amino acid at eachaligned location. Element 13, further including: identifying, in anucleotide string, a positive signature when the nucleotide string codesan allowed amino acid in the functional protein, and a negativesignature when the nucleotide string codes a non-allowed amino acid inthe functional protein; graphically marking a mutation of the nucleotidestring on the positive signature and the negative signature; andoptionally determining a deleterious effect of the mutation based onwhether the mutation occurs within the positive signature or thenegative signature. Element 14, further including graphically indicatinga hydropathy of each variable amino acid at each aligned location.Element 15, further including identifying, in a nucleotide string codinga protein domain in the functional protein, a mutation leading to adisallowed amino acid; determining a mutated hydropathy signature of theprotein domain based on a hydropathy of a mutated amino acid;determining a normal hydropathy signature of the protein domain based ona hydropathy of an allowed amino acid or a disallowed amino acid;determining a deleteriousness score for the mutation based on adifference between the mutated hydropathy signature of the proteindomain and the normal hydropathy signature of the protein domain; anddetermining a deleteriousness score for the mutation based on whether amutation occurs within a positive signature indicating nodeleteriousness or a negative signature indicating a deleteriousness.

Element 16, wherein identifying the promoter sequence includesidentifying at least one of a TATA box, a CAAT box, a GC box, and aninitiator box. Element 17, wherein the similarity score includes acombination of one or more of a TATA box, a CAAT box, a GC box, and aninitiator box. Element 18, wherein identifying the promoter sequenceincludes identifying a TATA box, CAAT box, GC box, and initiator box,and, in addition, enhancers and silencers. Element 19, further includingdetermining the similarity score by executing instructions from analgorithm selected from a group consisting of a Shapiro-Senapathyalgorithm, a MaxEntScan algorithm, and NNSplice algorithm, stored in amemory. Element 20, wherein identifying a poly-A site includesidentifying a signal that includes a nucleotide sequence that signals anappearance of the poly-A site near the signal. Element 21, furtherincluding graphically marking in the display for the user, a real poly-Asite. Element 22, further including determining the similarity score byexecuting instructions from an algorithm selected from a groupconsisting of a Shapiro-Senapathy algorithm, a MaxEntScan algorithm, andNNSplice algorithm, stored in a memory. Element 23, further includingidentifying, in the nucleotide string, an enhancer of a poly-A site.Element 24, further including identifying, in the nucleotide string, asilencer of a poly-A site.

Element 25, further comprising identifying the different types of ncRNAgenes in the dark matter genome using known ncRNA gene predictionalgorithms and proprietary algorithms, and, further multiple algorithmsfor each ncRNA type so as to discover most of the genuine genes. Element26, further comprising taking variable AA strings from differentindividuals of the same organism such as the human, and constructing theallowable (positive) and non-allowable (negative) signatures. Element27, further comprising taking variable AA strings from differentindividuals of the same organism such as the human, and discovering newdomains by the presence of highly variable, invariable, and low variableAAs similar to and characteristic of genuine domains. Element 28,further comprising determining a distinct PWM or variable sequencesignature for each of the splicing elements, say donor, or otherregulatory or splicing elements, based on the multiple sequencealignment of genome sequences of numerous individuals from the sameorganism. Element 29, further comprising predicting novel promoters,binding sites, or other regulatory and splicing elements, from the PWMand MSA of genome sequences of numerous individuals, wherein, thebinding sites show less variance compared to other positions, or otherstatistically distinct characteristics, and determining mutations withinthese novel binding sites. Element 30, further comprising creating adatabase of all these novel elements from the genome of an organism.Element 31, further comprising determining that the invariance of the AAdirectly correlates with the deleteriousness of a mutation, indicatingthat the mutation at an invariant AA position is the most deleterious,with decreasing deleteriousness correlating with increasing amino acidvariability, and applying this to determine the deleteriousness of apatient mutation. Element 32, further comprising identifying anucleotide that is different from an allowable nucleotide as adisallowed nucleotide at the aligned location. Element 33, whereingraphically marking the variable nucleotides comprises stacking anon-redundant nucleotide at each position of the additional nucleotidestring in the functional ncRNA. Element 34, further comprisinggraphically distinguishing, in the display for the user, the allowed anda disallowed nucleotide at each aligned location. Element 35, furthercomprising: identifying, in a nucleotide string, a positive signature,and a negative signature from the allowed and disallowed nucleotides;graphically marking a mutation of the nucleotide string on the positivesignature and the negative signature; and optionally determining adeleterious effect of the mutation based on whether the mutation occurswithin the positive signature or the negative signature. Element 36,further comprising, displaying the mutations in each of the geneticelements in each of the non-coding RNA genes, on the gene structure,depicting the processing steps of the ncRNA gene into the activeelement, and additionally elaborating these features in a sequence view,indicating the steps at which the processing error occurs.

Element 37, further including identifying a nucleotide that is differentfrom an allowable nucleotide as a disallowed nucleotide at the alignedlocation. Element 38, wherein graphically marking the variablenucleotides includes stacking a non-redundant nucleotide at eachposition of the additional nucleotide string in the non-coding RNA gene.Element 39, further including graphically distinguishing, in the displayfor the user, the allowable nucleotide and a disallowed nucleotide ateach aligned location. Element 40, further including identifying, in anucleotide string, a positive signature, and a negative signature fromthe allowable nucleotide and a disallowed nucleotide; graphicallymarking a mutation of the nucleotide string on the positive (allowed)signature and the negative (dis-allowed) signature; and optionallydetermining a deleterious effect of the mutation based on whether themutation occurs within the positive signature or the negative signature.Element 41, further including identifying a recognition sequence elementin each of the non-coding RNA genes by using instructions contained inalgorithms such as Shapiro-Senapathy, NNSplice, MaxEntScan, or theirmodified versions therefore; optionally, displaying the recognitionsequence element on a gene structure, depicting the non-coding RNA geneinto an active element, and additionally elaborating these features in asequence view; and indicating a position of a sequence error. Element42, further including displaying a mutation in the non-coding RNA gene;depicting the non-coding RNA gene in an active element; elaborating asequence view; and indicating an error position in the sequence view.Element 43, further including taking variable AA strings from differentindividuals of a same organism, and constructing an allowable signatureand a non-allowable signature. Element 44, further including takingvariable AA strings from different individuals of a same organism; anddiscovering new domains by at least one of a highly variable, aninvariable, a low variable AAs similar to and characteristic genuinedomains, discarding a random nucleotide (the four bases) sites thatindicate non-functional regions. Element 45, further includingdetermining a distinct PWM or variable sequence signature for a splicingelement, say donor, or other regulatory or the splicing element, basedon a multiple sequence alignment of gene or genome sequences of numerousindividuals from a same species or a group of organisms consisting ofsimilar species. Element 46, further including predicting novelpromoters, binding sites or other regulatory and splicing elements, froma PWM and an MSA of a gene sequence for multiple individuals, wherein,the binding sites show a mixture of low, medium, and high variancecompared to other random nucleotide positions, or other statisticallydistinct characteristics indicative of functional regions, anddetermining mutations within these novel binding sites. Element 47,further including creating a database of many these novel elements froma genome of an organism. Element 48, further including correlating aninvariance or a degree of variance of an AA pair combination with adeleteriousness of a mutation, indicating that the mutation at aninvariant AA position is highly deleterious, with a decreasingdeleteriousness correlating with increasing amino acid variability; andapplying this to determine the deleteriousness of a patient mutation.

In one aspect, a method may be an operation, an instruction, or afunction and vice versa. In one aspect, a claim may be amended toinclude some words (e.g., instructions, operations, functions, orcomponents) recited in other one or more claims, one or more words, oneor more sentences, one or more phrases, one or more paragraphs, and/orone or more claims.

To illustrate the interchangeability of hardware and software, itemssuch as the various illustrative blocks, modules, components, methods,operations, instructions, and algorithms have been described generallyin terms of their functionality. Whether such functionality isimplemented as hardware, software, or a combination of hardware andsoftware depends upon the particular application and design constraintsimposed on the overall system. Skilled artisans may implement thedescribed functionality in varying ways for each particular application.

As used herein, the phrase “at least one of” preceding a series ofitems, with the terms “and” or “or” to separate any of the items,modifies the list as a whole, rather than each member of the list (e.g.,each item). The phrase “at least one of” does not require selection ofat least one item; rather, the phrase allows a meaning that includes atleast one of any one of the items, and/or at least one of anycombination of the items, and/or at least one of each of the items. Byway of example, the phrases “at least one of A, B, and C” or “at leastone of A, B, or C” each refer to only A, only B, or only C; anycombination of A, B, and C; and/or at least one of each of A, B, and C.

The word “exemplary” is used herein to mean “serving as an example,instance, or illustration.” Any embodiment described herein as“exemplary” is not necessarily to be construed as preferred oradvantageous over other embodiments. Phrases such as an aspect, theaspect, another aspect, some aspects, one or more aspects, animplementation, the implementation, another implementation, someimplementations, one or more implementations, an embodiment, theembodiment, another embodiment, some embodiments, one or moreembodiments, a configuration, the configuration, another configuration,some configurations, one or more configurations, the subject technology,the disclosure, the present disclosure, other variations thereof andalike are for convenience and do not imply that a disclosure relating tosuch phrase(s) is essential to the subject technology or that suchdisclosure applies to all configurations of the subject technology. Adisclosure relating to such phrase(s) may apply to all configurations,or one or more configurations. A disclosure relating to such phrase(s)may provide one or more examples. A phrase such as an aspect or someaspects may refer to one or more aspects and vice versa, and thisapplies similarly to other foregoing phrases.

A reference to an element in the singular is not intended to mean “oneand only one” unless specifically stated, but rather “one or more.”Pronouns in the masculine (e.g., his) include the feminine and neutergender (e.g., her and its) and vice versa. The term “some” refers to oneor more. Underlined and/or italicized headings and subheadings are usedfor convenience only, do not limit the subject technology, and are notreferred to in connection with the interpretation of the description ofthe subject technology. Relational terms such as first and second andthe like may be used to distinguish one entity or action from anotherwithout necessarily requiring or implying any actual such relationshipor order between such entities or actions. All structural and functionalequivalents to the elements of the various configurations describedthroughout this disclosure that are known or later come to be known tothose of ordinary skill in the art are expressly incorporated herein byreference and intended to be encompassed by the subject technology.Moreover, nothing disclosed herein is intended to be dedicated to thepublic regardless of whether such disclosure is explicitly recited inthe above description. No claim element is to be construed under theprovisions of 35 U.S.C. § 112, sixth paragraph, unless the element isexpressly recited using the phrase “means for” or, in the case of amethod claim, the element is recited using the phrase “step for.”

While this specification contains many specifics, these should not beconstrued as limitations on the scope of what may be described, butrather as descriptions of particular implementations of the subjectmatter. Certain features that are described in this specification in thecontext of separate embodiments can also be implemented in combinationin a single embodiment. Conversely, various features that are describedin the context of a single embodiment can also be implemented inmultiple embodiments separately or in any suitable subcombination.Moreover, although features may be described above as acting in certaincombinations and even initially described as such, one or more featuresfrom a described combination can in some cases be excised from thecombination, and the described combination may be directed to asubcombination or variation of a subcombination.

The subject matter of this specification has been described in terms ofparticular aspects, but other aspects can be implemented and are withinthe scope of the following claims. For example, while operations aredepicted in the drawings in a particular order, this should not beunderstood as requiring that such operations be performed in theparticular order shown or in sequential order, or that all illustratedoperations be performed, to achieve desirable results. The actionsrecited in the claims can be performed in a different order and stillachieve desirable results. As one example, the processes depicted in theaccompanying figures do not necessarily require the particular ordershown, or sequential order, to achieve desirable results. In certaincircumstances, multitasking and parallel processing may be advantageous.Moreover, the separation of various system components in the aspectsdescribed above should not be understood as requiring such separation inall aspects, and it should be understood that the described programcomponents and systems can generally be integrated together in a singlesoftware product or packaged into multiple software products.

The title, background, brief description of the drawings, abstract, anddrawings are hereby incorporated into the disclosure and are provided asillustrative examples of the disclosure, not as restrictivedescriptions. It is submitted with the understanding that they will notbe used to limit the scope or meaning of the claims. In addition, in thedetailed description, it can be seen that the description providesillustrative examples and the various features are grouped together invarious implementations for the purpose of streamlining the disclosure.The method of disclosure is not to be interpreted as reflecting anintention that the described subject matter requires more features thanare expressly recited in each claim. Rather, as the claims reflect,inventive subject matter lies in less than all features of a singledisclosed configuration or operation. The claims are hereby incorporatedinto the detailed description, with each claim standing on its own as aseparately described subject matter.

The claims are not intended to be limited to the aspects describedherein, but are to be accorded the full scope consistent with thelanguage claims and to encompass all legal equivalents. Notwithstanding,none of the claims are intended to embrace subject matter that fails tosatisfy the requirements of the applicable patent law, nor should theybe interpreted in such a way.

What is claimed is:
 1. A computer-implemented method comprising:receiving a nucleotide string comprising a plurality of nucleotides fromat least a portion of two or more individuals' genome, wherein theportion of the genome includes at least one genetic element of: a5′-UTR, a promoter, an enhancer, a silencer, an exon, an intron, acoding sequence, a non-protein coding RNA, a splice acceptor, a splicedonor, a branch point site, a 3′-UTR, a Kozak sequence, a poly-Aaddition site or signal, or a cryptic version thereof, from a knownprotein coding gene or a regulatory, splicing, or functional element ofa non-protein coding RNA gene, and within genes not yet identified in aDark Matter genome; identifying, in the nucleotide string based on achromosomal position, a genetic element such as a coding element, exon,intron, 5′-UTR, 3′-UTR, promoter, a splice acceptor, a splice donor, abranch point site, a Kozak sequence, a poly-A addition site or signal,an enhancer, or a silencer, or their cryptic version thereof, from aknown protein coding gene or a regulatory, splicing, or functionalelement of the non-protein coding RNA gene; and, determining a variablesequence signature or position weight matrix (PWM) for a particulargenetic element of a particular gene, based on a multiple sequencealignment of the same element at the same chromosomal or genomicposition within a gene or in the genome sequences of one or moreindividuals from a same species.
 2. The computer-implemented method ofclaim 1, further comprising: determining a variable sequence signatureor position weight matrix (PWM) based on di, tri, or longeroligonucleotides, for a particular genetic element of a particular gene,based on a multiple sequence alignment of the same element at the samechromosomal or genomic position within a gene or in the genome sequencesof one or more individuals from a same species.
 3. Thecomputer-implemented method of claim 1, further comprising: predictingnovel genetic elements such as promoters, recognition sequences, bindingsites or regulatory and splicing elements throughout the genome, basedon the PWM constructed from a multiple sequence alignment of anucleotide sequence at a particular chromosomal position in genomes frommultiple individuals of a same species or organism, wherein the novelelements show variable nucleotide frequencies that exhibit non-randomcharacteristics indicative of the PWM of genuine structural orfunctional genetic elements, or statistically distinct characteristicsindicative of functional regions, compared to random nucleotidepositions.
 4. The computer-implemented method of claim 1, furthercomprising: predicting novel genetic elements such as promoters,recognition sequences, binding sites or other regulatory and splicingelements, based on the PWMs constructed from a set length identifiedfrom a multiple sequence alignment of a nucleotide sequence at achromosomal position in the genomes from multiple individuals of a samespecies or organism, at every consecutive position in a genome, wherein,the PWMs show a variable nucleotide frequencies that exhibit non-randomcharacteristics, typical of PWMs of genuine functional genetic elements,or other statistically distinct characteristics indicative of functionalregions, compared to random nucleotide positions.
 5. Thecomputer-implemented method of claim 1, further comprising: modifying aShapiro Senapathy algorithm, a MaxEntScan algorithm, a NNSplicealgorithm, or any algorithm for identifying the genetic elements basedon the PWM or variable sequence signature for the genetic elementconstructed from mono, di, tri, or longer oligo-nucleotides; assigning ascore to the structural or functional genetic element; and, identifyingdeleterious or strength altering mutations in the functional geneticelement based on similarity scores calculated from a modified algorithmsuch as a Shapiro Senapathy algorithm, MaxEntScan algorithm, NNSplicealgorithm, or any algorithm based on di, tri, or longeroligo-nucleotides.
 6. The computer-implemented method of claim 1,further comprising: aligning the plurality of nucleotides from unknownregions such as multiple protein binding promoter sites, polyA sites,splice sites upstream of, downstream of, or within or around the genefrom a number of individuals from a same species or organism, so as tocreate a recognizable pattern of the PWM consisting of invariable orvariable nucleotides that are not randomly distributed at a givenposition as having structural, functional or biological implications ofgene regulatory or splicing elements.
 7. A computer-implemented methodcomprising: identifying, in a nucleotide string, based on a chromosomalposition, a non-coding RNA gene such as a miRNA, tRNA, rRNA, or snoRNA;determining, based on a similarity score using a prediction algorithm, agenetic element comprising a regulatory, splicing, or a functional RNAelement of the non-coding RNA gene; identifying, a difference betweenthe first similarity score of a normal genetic element and the secondsimilarity score of a mutated genetic element of the non-coding RNAgene; determining the causality of a phenotype by a sequence variantbased on the difference between the first similarity score and thesecond similarity score; and, graphically marking, in a display for auser, the nucleotide string at a location indicative of an exon, anintron, regulatory, splicing or functional RNA element when the firstsimilarity score or the second similarity score is higher or lower than,or equal to, a pre-selected threshold on a gene structure or sequenceview.
 8. The computer-implemented method of claim 7, further comprising,identifying, in the nucleotide string, a positive signature, and anegative signature from an allowable mono, di, tri or longeroligo-nucleotide and a disallowed mono, di, tri or longeroligo-nucleotide; graphically marking a mutation of the nucleotidestring on the positive signature and the negative signature; and,determining a deleterious effect of the mutation based on whether themutation occurs within the positive signature or the negative signature.9. The computer-implemented method of claim 7, further comprising,displaying a recognition sequence, regulatory, splicing, or processedfunctional element on the gene structure; depicting the processing stepsof the non-coding RNA gene into an active element; elaborating theprocessing steps in the gene structure or sequence view; indicatingmutations and the processing steps at which a processing error occurs;and, elaborating on a mechanism of aberrations within the ncRNA genecausing a biological or clinical phenotype.
 10. The computer-implementedmethod of claim 7, further comprising: constructing a position weightmatrix (PWM) for regulatory, or splicing elements, and recognitionsequences for the processing of a non-coding RNA gene; and constructingthe PWM for a processed functional non-coding RNA gene product, for anindividual type of the non-coding RNA gene such as the miRNA, tRNA orrRNA; and constructing the PWM for the non-coding RNA gene, by aligningthe nucleotide sequences of a particular non-coding RNA gene from anumber of individuals of the same organism at a particular chromosomalposition.
 11. The computer-implemented method of claim 7, furthercomprising: constructing a variable sequence signature based on a numberor frequency of variable mono, di, tri or longer oligonucleotides ateach position of the genetic element from aligned sequences; and,determining a deleteriousness score of a mutation of the non-coding RNAgene, or regulatory, splicing, recognition sequence elements, or aprocessed functional non-coding RNA product, based on the differencebetween the first similarity score of a normal genetic element and thesecond similarity score of the mutated element, calculated from theposition weight matrix (PWM).
 12. The computer-implemented method ofclaim 7, further comprising: determining the similarity score byexecuting instructions from an algorithm selected from a groupconsisting of algorithms such as Shapiro-Senapathy algorithm, aMaxEntScan algorithm, and NNSplice algorithm, stored in a memory;determining the similarity score by executing instructions from amodified algorithm selected from a group consisting of algorithms suchas Shapiro-Senapathy algorithm, a MaxEntScan algorithm, and NNSplicealgorithm, stored in a memory, based on characteristics of splicingelement sequence signals such as length or variability; and, determininga combined score of the group of algorithms based on correspondingaverage or differentially weighted scores.
 13. The computer-implementedmethod of claim 7, further comprising: determining and graphicallymarking variable nucleotides by stacking a non-redundant mono, di, trior longer oligo-nucleotides at each position of an additional nucleotidestring from a multiple sequence alignment of multiple non-coding RNAgenes of the same type; and, applying a position weight matrix (PWM)methodology using these oligo-nucleotides for detecting the multiplenon-coding RNA genes and corresponding genetic elements.
 14. Acomputer-implemented method comprising: identifying a first amino acidstring corresponding to a functional protein or a protein domain;aligning said first amino acid string with at least one additional aminoacid string that encodes a functional variant of said functionalprotein; identifying, at each amino acid position within said additionalamino acid string, multiple variable amino acids that appear in the atleast one additional amino acid string for each aligned location in thefirst amino acid string; and graphically marking, in a display for auser, a variable amino acid as an allowable amino acid at an alignedlocation in said first amino acid string.
 15. The computer-implementedmethod of claim 14, further comprising: identifying an amino acid thatis different from an allowable amino acid as a disallowed amino acid atthe aligned location; graphically stacking a non-redundant disallowedamino acid as a variable amino acid at each position of the additionalamino acid string in the functional protein; and graphicallydistinguishing, in the display for a user, an allowed amino acid and thedisallowed amino acid at each aligned location.
 16. Thecomputer-implemented method of claim 14, further comprising:distinguishing allowed variable amino acids of a protein or a domain asa positive signature, and disallowed variable amino acids of the proteinor the domain as a negative signature; determining a deleterious effectof a mutation based on whether the mutation occurs within the positivesignature or negative signature; and graphically marking a mutation onthe positive signature or the negative signature.
 17. Thecomputer-implemented method of claim 14, further comprising: graphicallyindicating a hydropathy value of each variable amino acid at eachaligned location;— determining a hydropathy value based on the averageof hydropathy values of each of the variable amino acids at eachlocation; determining a hydropathy value for a region of amino acidsbased on the average of hydropathy values at each amino acid position ina given amino acid sequence region; determining a normal hydropathysignature of a protein domain based on the hydropathy value of anallowed amino acid; determining a mutated hydropathy signature of asequence portion of a protein domain based on the hydropathy value of amutated amino acid; and determining a deleteriousness score for themutation based on a difference between the normal hydropathy signatureand the mutated hydropathy signature, or an average hydropathy index ofthe plurality of variable amino acids of a mutated position before andafter mutation.
 18. The computer-implemented method of claim 14, furthercomprising: correlating an invariance or a degree of variance of anamino acid position with a deleteriousness of a mutation; indicatingthat the mutation at an invariant amino acid position is deleterious,wherein decreasing deleteriousness is correlated with increasing aminoacid variability; and applying the correlated invariance or degree ofvariance to determine the deleteriousness of the mutation.
 19. Thecomputer-implemented method of claim 14, further comprising:constructing an allowable and a non-allowable variable amino acidsequence signature based on variable amino acid strings of a protein ordomain sequence at the same chromosomal position from differentindividuals of a same organism; determining a frequency of eachallowable amino acid at every position that occurs across the differentindividuals; defining an algorithm based on the frequencies of differentamino acids to assign scores for individual allowable amino acids ateach position; determining the deleteriousness of a variable amino acidat a position based on the frequency of an allowable amino acid, withdeleteriousness decreasing with increasing variability score.
 20. Thecomputer-implemented method of claim 14, further comprising, aligning agenome sequence from an individual of an organism with the genome ofanother individual of the same organism, at the same chromosomal orgenomic position, to construct variable mono, di, tri oroligonucleotides, or mono, di, tri or oligo amino acids; and predicting,based on aligning the genome sequence of multiple individuals of thesame organism, regulatory elements, splicing elements, variable aminoacids, domains, proteins, genes, exons, introns, or intergenic regions,throughout the genome.
 21. The computer-implemented method of claim 14,further comprising: constructing a variable amino acid sequencesignature based on variable amino acid strings of the plurality ofvariable amino acids representing a possible domain or portion of thepossible domain, in open reading frames (ORFs), exonic, intronic,intergenic, or genic regions throughout the genome, at the samechromosomal or genomic position, from different individuals of a sameorganism; and identifying new genes, exons, introns, coding sequence,regulatory or splicing elements, domains, or proteins, protein codinggenes or non coding RNA genes, based on portions of variable amino acidsequence signature by comparing with genes predicted within the genome,employing gene prediction programs using various parameters;
 22. Thecomputer-implemented method of claim 14, further comprising:constructing a variable amino acid sequence signature based on variableamino acid strings of the plurality of variable amino acids representinga possible domain or portion of the possible domain, in open readingframes (ORFs), exonic, intronic, intergenic, or genic regions of thegenome, at the same chromosomal or genomic position, from differentindividuals of a same organism; and discovering new domains by thepresence of the variable amino acid sequence signature similar to andcharacteristic of variable sequence signatures of genuine domains, bysearching in all six reading frames of a nucleotide sequence throughoutthe genome from different individuals of the same organism.